Using suffix trees and suffix arrays for duplicate code detection

Men, Huaying

Men, H. (2008). Using suffix trees and suffix arrays for duplicate code detection. https://scholar.acadiau.ca/islandora/object/theses:2756

Search for this publication on Google Scholar

Details:

Title: Using suffix trees and suffix arrays for duplicate code detection
Author: Men, Huaying
Call Number: LE3 .A278 2008
Date: 2008
Supervisor: Giles, Rick
Degree Grantor: Acadia University
Degree Name: Master of Science
Degree Level: Masters
Discipline: Computer Science
Affiliation: Computer Science
Abstract: Code duplication occurs frequently during software development and maintenance. Copy and paste makes a programmer's job easier and faster; however, duplicate code can cause serious problems. Code duplication makes maintenance more complicated because when a "bug" is found in a part of the duplicate code, all possible duplications must be checked. Even if a changed part of the duplicate code is bug-free, all the duplicates require changes, which are easily forgotten by programmers. In order to assist in the reduction of code duplication, a number of techniques for detecting code duplication have been developed. Some algorithms have exponential running time, which is impractical for detecting code duplication in large code bases. Therefore, it is necessary to find an algorithm that can detect duplicate code within a practical time frame, e.g. linear time. This thesis discusses using a suffix tree as a data structure in algorithms to check for duplicate source code. Moreover, it also discusses a suffix array as an alternative data structure to suffix tree for space improvement. During the development stage of software, it is quite common for a programmer to make some changes after code is copied and pasted. Detecting algorithms should be able to find such code duplication. There are few algorithms that can identify the duplicate code with code re-ordering. This thesis proposes a methodology to handle such challenges and implements proposed algorithms as a tool, DupDetector, which can detect syntax similarities in a set of Java source code files more efficiently than the existing algorithms.
Rights: The author retains copyright in this thesis. Any substantial copying or any other actions that exceed fair dealing or other exceptions in the Copyright Act require the permission of the author.
Permanent Link: https://scholar.acadiau.ca/islandora/object/theses:2756

Acadia Scholar

Using suffix trees and suffix arrays for duplicate code detection

Details:

Bookmarks: