The term code clone refer to the duplicated code in a source code file. Its detection can lead to debug, to refactoring. It can be also used in order to avoid plagiarism or copyright violation.
This project was a part of an intership I’ve made during Summer 2019, in AIST Kansai. The purpose of this subject was to create a tool in order to detect the duplicated code in a source code. The first step was to detect the exact matching pairs. The structure I’m using is a suffix tree, a compressed suffix tree. The first algorithm I have implemented is drawn from a lecture note (p. 18-20) who introduce and explain the role of suffix trees in matching pairs detection. The second one is drawn from the same algorithm but with a different data structure. Finally, the last one is using another solution because the first two weren’t lived up to our expectations.
If you want to use the program or modify the code, you may need :
It’s used for only one of the implemented methods.
Just cloning the repo will do the work :
git clone https://github.com/adrien-gide/Duplifinder.git
After cloning the repo, you may want to test the program. You can create an executable who will run a few examples, according to the files in Test files. To create and run all the tests :
cd Duplifinder
make test.exe
./test.exe
The framework used for the Unit tests is Catch2. It’s simply a header and the usage in this project is very basic but you can see how it works on the git.
You can call specific sections in order to test one method in particular. There are 3 Test cases : “CST”, “ST” and “BWT”. You can call one of them like this, and it will launch only the chosen test :
./test.exe CST
There are major sections in each test case which are “1” and “MULT”. Each one of them has sub-sections : “Simple”, “Complicated”, “Lines” and “Code”.
To call the section “1” of “CST” for example :
./test.exe CST -c 1
To call the sub-section “Code” of “MULT” of “ST” for example :
./test.exe ST -c MULT -c Code
For any more advanced use of this framework or more information, check the original git.
First, you need to create the executables with :
cd Duplifinder
make all
It will create the 3 main program : cst.exe , st.exe and bwt.exe. The first two implement the same algorithm, but the first is using compressed suffix tree and the other normal suffix tree. The implementation of compressed suffix tree is the one from the SDSL-lite library and the simple suffix tree implementation is the one from adamserafini’s github. The third one is another method because the previous ones weren’t working as well as we expected. It’s using wavelet tree and Burrow’s and Wheeler’s Transform (BWT)
You can create a specific executable with this line (for example st.exe:
make st.exe
In order to use the programs, you need to be aware of the different identifier available. You need to use one of them in order to specify the kind of operation you want to do.
Hence, you have multiple uses of the executables :
.exe -s file_name (lower_bound) (upper_bound)
.exe -m nb_files file_name1 file_name2 ... (lower_bound) (upper_bound)
.exe -r directory_name (lower_bound) (upper_bound)
The attributes “lower_bound” and “upper_bound” are only working for cst.exe and st.exe. For bwt.exe, it will be “min_lgth” and “min_occ”. If you use “min_occ”, you have to also use “min_lgth” but the opposite is false.
However, the rest remain the same. For example :
.exe -m nb_files file_name1 file_name2 ... (min_lgth) (min_occ)
There are only 3 different parts :
The class Duplifinder inherit the sdsl-lite class cst_sada. It allows us to use the methods and the types provided by this class.
This project is not done and has some things who need to be finished :
The function are documented in a Doxygen generated doc (not updated).