Syntax tree fingerprinting: a foundation for source code similarity detection

Abstract : Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Complete list of metadatas

Cited literature [36 references]  Display  Hide  Download

https://hal-upec-upem.archives-ouvertes.fr/hal-00627811
Contributor : Etienne Duris <>
Submitted on : Thursday, September 29, 2011 - 4:07:23 PM
Last modification on : Wednesday, July 4, 2018 - 4:37:54 PM
Long-term archiving on : Tuesday, November 13, 2012 - 2:50:40 PM

File

HAL.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00627811, version 1

Citation

Michel Chilowicz, Étienne Duris, Gilles Roussel. Syntax tree fingerprinting: a foundation for source code similarity detection. 2009. ⟨hal-00627811⟩

Share

Metrics

Record views

476

Files downloads

2891