Code Clone Analysis


Book Description

This is the first book organized around code clone analysis. To cover the broad studies of code clone analysis, this book selects past research results that are important to the progress of the field and updates them with new results and future directions. The first chapter provides an introduction for readers who are inexperienced in the foundation of code clone analysis, defines clones and related terms, and discusses the classification of clones. The chapters that follow are categorized into three main parts to present 1) major tools for code clone analysis, 2) fundamental topics such as evaluation benchmarks, clone visualization, code clone searches, and code similarities, and 3) applications to actual problems. Each chapter includes a valuable reference list that will help readers to achieve a comprehensive understanding of this diverse field and to catch up with the latest research results. Code clone analysis relies heavily on computer science theories such as pattern matching algorithms, computer language, and software metrics. Consequently, code clone analysis can be applied to a variety of real-world tasks in software development and maintenance such as bug finding and program refactoring. This book will also be useful in designing an effective curriculum that combines theory and application of code clone analysis in university software engineering courses.




Clone Evolution


Book Description

Duplicated passages of source code - code clones - are a common property of software systems. While clones are beneficial in some situations, their presence causes various problems for software maintenance. Most of these problems are strongly related to change and include, for example, the need to propagate changes across duplicated code fragments and the risk of inconsistent changes to clones that are meant to evolve identically. Hence, we need a sophisticated analysis of clone evolution to better understand, assess, and manage duplication in practice. This thesis introduces Clone Evolution Graphs as a technique to model clone relations and their evolution within the history of a system. We present our incremental algorithm for efficient and automated extraction of Clone Evolution Graphs from a system's history. The approach is shown to scale even for large systems with long histories making it applicable to retroactive analysis ofclone evolution as well as live tracking of clones during software maintenance.We have used Clone Evolution Graphs in several studies to analyze versatile aspects of clone evolution in open-source as well as industrial systems. Our results show that the characteristics of clone evolution are quite different between systems, highlighting the need for a sophisticated technique like Clone Evolution Graphs to track clones and analyze their evolution on a per-system basis. We have also shown that Clone Evolution Graphs are well-suited to analyze the change behavior of individual clones and can be used to identify problematic clones within a system. In general, the results of our studies provide new insights into how clones evolve, how they are changed, and how they are removed.




Empirical Research towards a Relevance Assessment of Software Clones


Book Description

Redundancies in program source code - software clones - are a common phenomenon. Although it is often claimed that software clones decrease the maintainability of software systems and need to be managed, research in the last couple of years showed that not all clones can be considered harmful. A sophisticated assessment of the relevance of software clones and a cost-benefit analysis of clone management is needed to gain a better understanding of cloning and whether it is truly a harmful phenomenon. This thesis introduces techniques to model, analyze, and evaluate versatile aspects of software clone evolution within the history of a system. We present a mapping of non-identical clones across multiple versions of a system, that avoids possible ambiguities of previous approaches. Though processing more data to determine the context of each clone to avoid an ambiguous mapping, the approach is shown to be efficient and applicable to large systems for a retrospective analysis of software clone evolution. The approach has been used in several studies to gain insights into the phenomenon of cloning in open-source as well as industrial software systems. Our results show that non-identical clones require more attention regarding clone management compared to identical clones as they are the dominating clone type for the main share of our subject systems. Using the evolution model to investigate costs and benefits of refactorings that remove clones, we conclude that clone removals could not reduce maintenance costs for most systems under study.




Toward an Understanding of Software Code Cloning as a Development Practice


Book Description

Code cloning is the practice of duplicating existing source code for use elsewhere within a software system. Within the research community, conventional wisdom has asserted that code cloning is generally a bad practice, and that code clones should be removed or refactored where possible. While there is significant anecdotal evidence that code cloning can lead to a variety of maintenance headaches -- such as code bloat, duplication of bugs, and inconsistent bug fixing -- there has been little empirical study on the frequency, severity, and costs of code cloning with respect to software maintenance. This dissertation seeks to improve our understanding of code cloning as a common development practice through the study of several widely adopted, medium-sized open source software systems. We have explored the motivations behind the use of code cloning as a development practice by addressing several fundamental questions: For what reasons do developers choose to clone code? Are there distinct identifiable patterns of cloning? What are the possible short- and long-term term risks of cloning? What management strategies are appropriate for the maintenance and evolution of clones? When is the ``cure'' (refactoring) likely to cause more harm than the ``disease'' (cloning)? There are three major research contributions of this dissertation. First, we propose a set of requirements for an effective clone analysis tool based on our experiences in clone analysis of large software systems. These requirements are demonstrated in an example implementation which we used to perform the case studies prior to and included in this thesis. Second, we present an annotated catalogue of common code cloning patterns that we observed in our studies. Third, we present an empirical study of the relative frequencies and likely harmfulness of instances of these cloning patterns as observed in two medium-sized open source software systems, the Apache web server and the Gnumeric spreadsheet application. In summary, it appears that code cloning is often used as a principled engineering technique for a variety of reasons, and that as many as 71% of the clones in our study could be considered to have a positive impact on the maintainability of the software system. These results suggest that the conventional wisdom that code clones are generally harmful to the quality of a software system has been proven wrong.







An Empirical Study for the Impact of Maintenance Activities in Clone Evolution


Book Description

Code clones are duplicated code fragments that are copied to re-use functionality and speed up development. However, due to the duplicate nature of code clones, inconsistent updates can lead to bugs in the software system. Existing research investigates the inconsistent updates through analysis of the updates to code clones and the bug fixes used to fix the inconsistent updates. We extend the work by investigating other factors that affect clone evolution, such as the number of developers. On two levels of analysis, the method and clone class level, we conduct an empirical study on clone evolution. We analyze the factors affecting bug fixes and co-change (i.e. update cloned methods at the same time) using our new metrics. Our metrics are related to the developers, code complexity, and stages of development. We use these metrics to find ways to improve the maintenance of cloned code. We discover that one way to improve maintenance of code clones is the decrease of code complexity. We find that increased code complexity leads to a decrease in co-change, which can lead to bugs in the software. We perform our study on 6 applications. To maximize the number of clones detected, we use two existing code clone detection tools: SimScan and Simian. SimScan was used to find clones in 5 of the applications due to its versatility in finding code clones. Simian was used to detect clones due to its reliability to find code clones regardless of language or compilation problems. To analyze and determine the significance of the metrics, we use the R Statistical Toolkit.




Software Clones - Guilty Until Proven Innocent?


Book Description

Software systems contain redundant code that originated from the use of copy and paste. While such cloning may be beneficial in the short term as it accelerates development, it is frequently despised as a risk to maintainability and quality in the long term. Code clones are said to cause extra change effort, because changes have to be propagated to all copies. They are also suspected to cause bugs when the copied code fragments are changed inconsistently. These accusations may be plausible but are not based on empirical facts. Indeed, they are prejudice. In the recent past, science has started the endeavor to find empirical evidence to support the alleged effects of clones. In this thesis, we analyze the effects of clones from three different perspectives. First, we investigate whether clones do indeed increase the maintenance effort in real and long lived software systems. Second, we analyze potential reasons for the cases where clones do cause bugs. Third, we take a new perspective to the problem by measuring the effects of clones in a controlled experiment. This allows us to gather new insights by observing software developers during their work, whereas previous studies were based on historical data. With our work we aim to empirically find advice for practitioners how to deal with clones and, if necessary, to provide an empirical basis for tools that help developers to manage clones.




Improving the Unification of Software Clones Using Tree and Graph Matching Algorithms


Book Description

Code duplication is common in all kind of software systems and is one of the most troublesome hurdles in software maintenance and evolution activities. Even though these code clones are created for the reuse of some functionality, they usually go through several modifications after their initial introduction. This has a serious negative impact on the maintainability, comprehensibility, and evolution of software systems. Existing code duplication can be eliminated by extracting the common functionality into a single module. In the past, several techniques have been developed for the detection and management of software clones. However, the unification and refactoring of software clones is still a challenging problem, since the existing tools are mostly focused on clone detection and there is no tool to find particularly refactoring-oriented clones. The programmers need to manually understand the clones returned by the clone detection tools, decide whether they should be refactored, and finally perform their refactoring. This obvious gap between the clone detection tools and the clone analysis tools, makes the refactoring tedious and the programmers reluctant towards refactoring duplicate codes. In this thesis, an approach for the unification and refactoring of software clones that overcomes the limitations of previous approaches is presented. More specifically, the proposed technique is able to detect and parameterize non-trivial differences between the clones. Moreover, it can find a mapping between the statements of the clones that minimizes the number of differences. We have also defined preconditions in order to determine whether the duplicated code can be safely refactored to preserve the behavior of the existing code. We compared the proposed technique with a competitive clone refactoring tool and concluded that our approach is able to find a significantly larger number of refactorable clones.




Toward Improved Understanding and Management of Software Clones


Book Description

The cloning of code is controversial as a development practice. Empirical studies on the long-term effects of cloning on software quality and maintainability have produced mixed results. Some studies have found that cloning has a negative impact on code readability, bug propagation, and the presence of cloning may indicate wider problems in software design and management. At the same time, other studies have found that cloned code is less likely to have defects, and thus is arguably more stable, better designed, and better maintained. These results suggest that the effect of cloning on software quality and maintainability may be determinable only on a case-by-case basis, and this only aggravates the challenge of establishing a principled framework of clone management and understanding. This thesis aims to improve the understanding and management of clones within software systems.