Publications
Code Clone Analysis: BigCloneBench
A book chapter of Code Clone Analysis, published by Springer; Research, Tools and Practices (2021).
Jeffrey Svajlenko, Chanchal Kumar Roy
Many clone detection tools and techniques have been created to tackle various use-cases, including syntactical clone detection, semantic clone detection, inter-project clone detection, large-scale clone detection and search, and so on. While a few clone benchmarks are available, none target this breadth of usage. BigCloneBench is a clone benchmark designed to evaluate clone detection tools across a variety of use-cases. It was built by mining a large inter-project source repository for functions implementing known functionalities. This produced a large benchmark of inter-project and intra-project semantic clones across the full spectrum of syntactical similarity. The benchmark is augmented with an evaluation framework named BigCloneEval which simplifies tool evaluation studies and allows the user to slice the benchmark based on the clone properties in order to evaluate for a particular use-case. We have used BigCloneBench in a number of studies that demonstrate its value, as well as show where it has been used by the research community. In this chapter, we discuss the clone benchmarking theory and the existing benchmarks, describe the BigCloneBench creation process, and overview the BigCloneEval evaluation procedure. We conclude by summarizing BigCloneBench’s usage in the literature, and present ideas for future improvements and expansion of the benchmark.
DOI (Chapter): 10.1007/978-981-16-1927-4_7
DOI (Book): 10.1007/978-981-16-1927-4
CloneCognition: Machine Learning Based Code Clone Validation Tool
The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Golam Mostaeen, Jeffrey Svajlenko, Banani Roy, Chanchal Kumar Roy and Kevin A. Schneider
A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool 'CloneCognition' to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation.
DOI: 10.1109/TSE.2019.2912962
The Mutation and Injection Framework: Evaluating Clone Detection Tools with Mutation Analysis
IEEE Transactions on Software Engineering
Jeffrey Svajlenko and Chanchal Kumar Roy
An abundant number of clone detection tools have been proposed in the literature due to the many applications and benefits of clone detection. However, there has been difficulty in the performance evaluation and comparison of these clone detectors. This is due to a lack of reliable benchmarks, and the manual efforts required to validate a large number of candidate clones. In particular, there has been a lack of a synthetic benchmark that can precisely and comprehensively measure clone-detection recall. In this paper, we present a mutation-analysis based benchmarking framework that can be used not only to evaluate the recall of clone detection tools for different types of clones but also for specific kinds of clone edits and without any manual efforts. The framework uses an editing taxonomy of clone synthesis for generating thousands of artificial clones, injects into code bases and automatically evaluates the subject clone detection tools following the mutation analysis approach. Additionally, the framework has features where custom clone pairs could also be used in the framework for evaluating the subject tools. This gives the opportunity of evaluating specialized tools for specialized contexts such as evaluating a tool's capability for the detection of complex Type-4 clones or real world clones without writing complex mutation operators for them. We demonstrate this framework by evaluating the performance of ten modern clone detection tools across two clone granularities (function and block) and three programming languages (Java, C and C\#). Furthermore, we provide a variant of the framework that can be used to evaluate specialized tools such as for large gaped clone detection. Our experiments demonstrate confidence in the accuracy of our Mutation and Injection Framework when comparing against the expected results of the corresponding tools, and widely used real-world benchmarks such as Bellon's benchmark and BigCloneBench.
DOI: 10.1145/3338906.3341182
On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools
2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)
Golam Mostaeen, Jeffrey Svajlenko, Banani Roy, Chanchal K. Roy and Kevin A. Schneider
A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.
DOI: 10.1109/SCAM.2018.00025
CCAligner: A Token Based Large-Gap Clone Detector
2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)
Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu and Chanchal K. Roy
Copying code and then pasting with large number of edits is a common activity in software development, and the pasted code is a kind of complicated Type-3 clone. Due to large number of edits, we consider the clone as a large-gap clone. Large-gap clone can reflect the extension of code, such as change and improvement. The existing state-of-the-art clone detectors suffer from several limitations in detecting large-gap clones. In this paper, we propose a tool, CCAligner, using code window that considers e edit distance for matching to detect large-gap clones. In our approach, a novel e-mismatch index is designed and the asymmetric similarity coefficient is used for similarity measure. We thoroughly evaluate CCAligner both for large-gap clone detection, and for general Type-1, Type-2 and Type-3 clone detection. The results show that CCAligner performs better than other competing tools in large-gap clone detection, and has the best execution time for 10MLOC input with good precision and recall in general Type-1 to Type-3 clone detection. Compared with existing state-of-the-art tools, CCAligner is the best performing large-gap clone detection tool, and remains competitive with the best clone detectors in general Type-1, Type-2 and Type-3 clone detection.
DOI: 10.1145/3180155.3180179
Fast, scalable and User-Guided Clone Detection
ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings
Jeffrey Svajlenko and Chanchal K. Roy
Despite the great number of clone detection approaches proposed in the literature, few have the scalability and speed to analyze large inter-project source datasets, where clone detection has many potential applications. Furthermore, because of the many uses of clone detection, an approach is needed that can adapt to the needs of the user to detect any kind of clone. We propose a clone detection approach designed for user-guided clone detection by exploiting the power of source transformation in a plugin based source processing pipeline. Clones are detected using a simple Jaccard-based clone similarity metric, and users customize the representation of their source code as sets of terms to target particular types or kinds of clones. Fast and scalable clone detection is achieved with indexing, sub-block filtering and input partitioning.
DOI: 10.1145/3183440.3195005
CloneWorks: A Fast and Flexible Large-Scale Near-Miss Clone Detection Tool
2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C)
Jeffrey Svajlenko and Chanchal K. Roy
Clone detection within large inter-project source-code repositories has numerous rich applications. CloneWorks is a fast and flexible clone detector for large-scale near-miss clone detection experiments. CloneWorks gives the user full control over the processing of the source code before clone detection, enabling the user to target any clone type or perform custom clone detection experiments. Scalable clone detection is achieved, even on commodity hardware, using our partitioned partial indexes approach. CloneWorks scales to 250MLOC in just four hours on an average workstation with good recall and precision.
DOI: 10.1109/ICSE-C.2017.78
Fast and Flexible Large-Scale Clone Detection with CloneWorks
2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C)
Jeffrey Svajlenko and Chanchal K. Roy
Clone detection in very-large inter-project repositories has numerous applications in software research and development. However, existing tools do not provide the flexibility researchers need to explore this emerging domain. We introduce CloneWorks, a fast and flexible clone detector for large-scale clone detection experiments. CloneWorks gives the user full control over the representation of the source code before clone detection, including easy plug-in of custom source transformation, normalization and filtering logic. The user can then perform targeted clone detection for any type or kind of clone of interest. CloneWorks uses our fast and scalable partitioned partial indexes approach, which can handle any input size on an average workstation using input partitioning. CloneWorks can detect Type-3 clones in an input as large as 250 million lines of code in just four hours on an average workstation, with good recall and precision as measured by our BigCloneBench.
DOI: 10.1109/ICSE-C.2017.3
BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench
2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)
Jeffrey Svajlenko and Chanchal K. Roy
Many clone detection tools have been proposed in the literature. However, our knowledge of their performance in real software systems is limited, particularly their recall. We previously introduced our BigCloneBench, a big clone benchmark of over 8 million clones within a large inter-project Java repository containing 25,000 open-source Java systems. In this paper we present BigCloneEval, a framework for evaluating clone detection tools with BigCloneBench. BigCloneEval makes it very easy for clone detection researchers to evaluate and compare clone detection tools. It automates the execution and evaluation of clone detection tools against the reference clones of BigCloneBench, and summarizes recall performance from a variety of perspectives, including per clone type, and per syntactical similarity regions.
DOI: 10.1109/ICSME.2016.62
A Machine Learning Based Approach for Evaluating Clone Detection Tools for a Generalized and Accurate Precision
Journal of Software Engineering and Knowledge Engineering, Vol. 26, No.09n10, 2016.
Jeffrey Svajlenko and Chanchal K. Roy
An important measure of clone detection performance is precision. However, there has been a marked lack of research into methods for efficiently and accurately measuring the precision of a clone detection tool. Instead, tool authors simply validate a small random sample of the clones their tools detected in a subject software system. Since there could be many thousands of clones reported by the tool, such a small random sample cannot guarantee an accurate and generalized measure of the tool’s precision for all the varieties of clones that can occur in any arbitrary software system. In this paper, we propose a machine-learning-based approach that can cluster similar clones together, and which can be used to maximize the variety of clones examined when measuring precision, while significantly reducing the biases a specific subject system has on the generality of the precision measured. Our technique reduces the efforts in measuring precision, while doubling the variety of clones validated and reducing biases that harm the generality of the measure by up to an order of magnitude. Our case study with the NiCad clone detector and the Java class library shows that our approach is effective in efficiently measuring an accurate and generalized precision of a subject clone detection tool.
DOI:
Efficiently Measuring an Accurate and GeneralizedClone Detection Precision using Clone Clustering
Proceedings of the 28th International Conference on Software Engineering and Knowledge Engineering
Jeffrey Svajlenko and Chanchal K. Roy
An important measure of clone detection performance is precision. However, there has been a marked lack of research into methods of efficiently and accurately measuring the precision of a clone detection tool. Instead, tool authors simply validate a small random sample of the clones their tools detected in a subject software system. Since there could be many thousands of clones reported by the tool, such a small random sample cannot guarantee an accurate and generalized measure of the tool's precision for all the varieties of clones that can occur in any arbitrary software system. In this paper, we propose a machine-learning based approach that can cluster similar clones together, and which can be used to maximize the variety of clones examined when measuring precision, while significantly reducing the biases a specific subject system has on the generality of the precision measured. Our technique reduces the efforts in measuring precision, while doubling the variety of clones validated and reducing biases that harm the generality of the measure by up to an order of magnitude. Our case study with the NiCad clone detector and the Java class library shows that our approach is effective in efficiently measuring an accurate and generalized precision of a subject clone detection tool.
DOI: 10.18293/SEKE2016-150
SourcererCC: Scaling Code Clone Detection to Big-Code
2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy and Cristina V. Lopes
Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.
DOI: 10.1145/2884781.2884877
Evaluating clone detection tools with BigCloneBench
2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)
Jeffrey Svajlenko and Chanchal K. Roy
Many clone detection tools have been proposed in the literature. However, our knowledge of their performance in real software systems is limited, particularly their recall. In this paper, we use our big data clone benchmark, BigCloneBench, to evaluate the recall of ten clone detection tools. BigCloneBench is a collection of eight million validated clones within IJaDataset-2.0, a big data software repository containing 25,000 open-source Java systems. BigCloneBench contains both intra-project and inter-project clones of the four primary clone types. We use this benchmark to evaluate the recall of the tools per clone type and across the entire range of clone syntactical similarity. We evaluate the tools for both single-system and cross-project detection scenarios. Using multiple clone-matching metrics, we evaluate the quality of the tools' reporting of the benchmark clones with respect to refactoring and automatic clone analysis use-cases. We compare these real-world results against our Mutation and Injection Framework, a synthetic benchmark, to reveal deeper understanding of the tools. We found that the tools have strong recall for Type-1 and Type-2 clones, as well as Type-3 clones with high syntactical similarity. The tools have weaker detection of clones with lower syntactical similarity.
DOI:
Evaluating Modern Clone Detection Tools
2014 IEEE International Conference on Software Maintenance and Evolution
Jeffrey Svajlenko and Chanchal K. Roy
Many clone detection tools and techniques have been introduced in the literature, and these tools have been used to manage clones and study their effects on software maintenance and evolution. However, the performance of these modern tools is not well known, especially recall. In this paper, we evaluate and compare the recall of eleven modern clone detection tools using four benchmark frameworks, including: (1) Bellon's Framework, (2) our modification to Bellon's Framework to improve the accuracy of its clone matching metrics, (3) Murakamki et al.'s extension of Bellon's Framework which adds type 3 gap awareness to the framework, and (4) our Mutation and Injection Framework. Bellon's Framework uses a curated corpus of manually validated clones detected by tools contemporary to 2002. In contrast, our Mutation and Injection Framework synthesizes a corpus of artificial clones using a cloning taxonomy produced in 2009. While still very popular in the clone community, there is some concern that Bellon's corpus may not be accurate for modern clone detection tools. We investigate the accuracy of the frameworks by (1) checking for anomalies in their results, (2) checking for agreement between the frameworks, and (3) checking for agreement with our expectations of these tools. Our expectations are researched and flexible. While expectations may contain inaccuracies, they are valuable for identifying possible inaccuracies in a benchmark. We find anomalies in the results of Bellon's Framework, and disagreement with both our expectations and the Mutation Framework. We conclude that Bellon's Framework may not be accurate for modern tools, and that an update of its corpus with clones detected by the modern tools is warranted. The results of the Mutation Framework agree with our expectations in most cases. We suggest that it is a good solution for evaluating modern tools.
DOI: 10.1109/ICSME.2014.54
Towards a Big Data Curated Benchmark of Inter-project Code Clones
Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, Mohammad Mamun Mia
Recently, new applications of code clone detection and search have emerged that rely upon clones detected across thousands of software systems. Big data clone detection and search algorithms have been proposed as an embedded part of these new applications. However, there exists no previous benchmark data for evaluating the recall and precision of these emerging techniques. In this paper, we present a Big Data clone detection benchmark that consists of known true and false positive clones in a Big Data inter-project Java repository. The benchmark was built by mining and then manually checking clones of ten common functionalities. The benchmark contains six million true positive clones of different clone types: Type-1, Type-2, Type-3 and Type-4, including various strengths of Type-3 similarity (strong, moderate, weak). These clones were found by three judges over 216 hours of manual validation efforts. We show how the benchmark can be used to measure the recall and precision of clone detection techniques.
DOI: 10.1109/ICSME.2014.77
Big Data Clone Detection Using Classical Detectors: An Exploratory Study
Journal of Software: Evolution and Process, Volume 27, Issue 6, June 2015.
Jeffrey Svajlenko, Iman Keivanloo, Chanchal K. Roy
Big data analysis is an emerging research topic in various domains, and clone detection is no exception. The goal is to create big data inter‐project clone corpora across open‐source or corporate‐source code repositories. Such corpora can be used to study developer behavior and to reduce engineering costs by extracting globally duplicated efforts into new APIs and as a basis for code completion and API usage support. However, building scalable clone detection tools is challenging. It is often impractical to use existing state‐of‐the‐art tools to analyze big data because the memory and execution time required exceed the average user's resources. Some tools have inherent limitations in their data structures and algorithms that prevent the analysis of big data even when extraordinary resources are available. These limitations are impossible to overcome if the source code of the tool is unavailable or if the user lacks the time or expertise to modify the tool without harming its performance or accuracy. In this research, we have investigated the use of our shuffling framework for scaling classical clone detection tools to big data. The framework achieves scalability on commodity hardware by partitioning the input dataset into subsets manageable by the tool and computing resources. A non‐deterministic process is used to randomly ‘shuffle’ the contents of the dataset into a series of subsets. The tool is executed for each subset, and its output for each is merged into a single report. This approach does not require modification to the subject tools, allowing their individual strengths and precision to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of the framework for the big data dataset, IJaDataset 2.0, which consists of 356 million lines of code from 25,000 open‐source Java projects. We begin with a computationally inexpensive version of our framework based on pure random shuffling. This version was successful at scaling the tools to IJaDataset but required many subsets to achieve a desirable recall. Using our findings, we incrementally improved the framework to achieve a satisfactory recall using fewer resources. We investigated the use of efficient file tracking and file‐similarity heuristics to bias the shuffling algorithm toward subsets of the dataset that contain undetected clone pairs. These changes were successful in improving the recall performance of the framework. Our study shows that the framework is able to achieve up to 90–95% of a tool's native recall using standard hardware.
DOI: 10.1002/smr.1662
ForkSim: Generating Software Forks for Evaluating Cross-Project Similarity Analysis Tools
2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation (SCAM)
Jeffrey Svajlenko, Chanchal K. Roy and Slawomir Duszynski
Software project forking, that is copying an existing project and developing a new independent project from the copy, occurs frequently in software development. Analysing the code similarities between such software projects is useful as developers can use similarity information to merge the forked systems or migrate them towards a reuse approach. Several techniques for detecting cross-project similarities have been proposed. However, no good benchmark to measure their performance is available. We developed ForkSim, a tool for generating datasets of synthetic software forks with known similarities and differences. This allows the performance of cross-project similarity tools to be measured in terms of recall and precision by comparing their output to the known properties of the generated dataset. These datasets can also be used in controlled experiments to evaluate further aspects of the tools, such as usability or visualization concepts. As a demonstration of our tool, we evaluated the performance of the clone detector NiCad for similarity detection across software forks, which showed the high potential of ForkSim.
DOI: 10.1109/SCAM.2013.6648182
A Mutation Analysis Based Benchmarking Framework for Clone Detectors
Jeffrey Svajlenko, Chanchal K. Roy and James R. Cordy
2013 7th International Workshop on Software Clones (IWSC)
In recent years, an abundant number of clone detectors have been proposed in literature. However, most of the tool papers have lacked a solid performance evaluation of the subject tools. This is due both to the lack of an available and reliable benchmark, and the manual efforts required to hand check a large number of candidate clones. In this tool demonstration paper we show how a mutation analysis based benchmarking framework can be used by developers and researchers to evaluate clone detection tools at a fine granularity with minimal effort.
DOI: 10.1109/IWSC.2013.6613033
Scaling classical clone detection tools for ultra-large datasets: An exploratory study
2013 7th International Workshop on Software Clones (IWSC)
Jeffrey Svajlenko, Iman Keivanloo and Chanchal K. Roy
Detecting clones from large datasets is an interesting research topic for a number of reasons. However, building scalable clone detection tools is challenging and it is often impossible to use existing state of the art tools for such large datasets. In this research we have investigated the use of our Shuffling Framework for scaling classical clone detection tools to ultra large datasets. This framework achieves scalability on standard hardware by partitioning the dataset and shuffling the partitions over a number of detection rounds. This approach does not require modification to the subject tools, which allows their individual strengths and precisions to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of our framework for six clone detection tools. The clones found during our experiment were used to comment on the cloning habits of the global Java open-source development community.
DOI: 10.1109/IWSC.2013.6613037
On the Value of a Prioritization Scheme for Resolving Self-Admitted Technical Debt
Journal of System and Software, Volume 135, January 2018
Solomon Mensah, Jacky Keung, Jeffrey Svajlenko, Kwabena Ebo Bennin and Qing Mi
Programmers tend to leave incomplete, temporary workarounds and buggy codes that require rework in software development and such pitfall is referred to as Self-admitted Technical Debt (SATD). Previous studies have shown that SATD negatively affects software project and incurs high maintenance overheads. In this study, we introduce a prioritization scheme comprising mainly of identification, examination and rework effort estimation of prioritized tasks in order to make a final decision prior to software release. Using the proposed prioritization scheme, we perform an exploratory analysis on four open source projects to investigate how SATD can be minimized. Four prominent causes of SATD are identified, namely code smells (23.2%), complicated and complex tasks (22.0%), inadequate code testing (21.2%) and unexpected code performance (17.4%). Results show that, among all the types of SATD, design debts on average are highly prone to software bugs across the four projects analysed. Our findings show that a rework effort of approximately 10 to 25 commented LOC per SATD source file is needed to address the highly prioritized SATD (vital few) tasks. The proposed prioritization scheme is a novel technique that will aid in decision making prior to software release in an attempt to minimize high maintenance overheads.
DOI: 10.1016/j.jss.2017.09.026