Using suffix trees and suffix arrays for duplicate code detection
LE3 .A278 2008
Master of Science
Code duplication occurs frequently during software development and maintenance. Copy and paste makes a programmer's job easier and faster; however, duplicate code can cause serious problems. Code duplication makes maintenance more complicated because when a "bug" is found in a part of the duplicate code, all possible duplications must be checked. Even if a changed part of the duplicate code is bug-free, all the duplicates require changes, which are easily forgotten by programmers. In order to assist in the reduction of code duplication, a number of techniques for detecting code duplication have been developed. Some algorithms have exponential running time, which is impractical for detecting code duplication in large code bases. Therefore, it is necessary to find an algorithm that can detect duplicate code within a practical time frame, e.g. linear time. This thesis discusses using a suffix tree as a data structure in algorithms to check for duplicate source code. Moreover, it also discusses a suffix array as an alternative data structure to suffix tree for space improvement. During the development stage of software, it is quite common for a programmer to make some changes after code is copied and pasted. Detecting algorithms should be able to find such code duplication. There are few algorithms that can identify the duplicate code with code re-ordering. This thesis proposes a methodology to handle such challenges and implements proposed algorithms as a tool, DupDetector, which can detect syntax similarities in a set of Java source code files more efficiently than the existing algorithms.
The author grants permission to the University Librarian at Acadia University to reproduce, loan or distribute copies of my thesis in microform, paper or electronic formats on a non-profit basis. The author retains the copyright of the thesis.