Syntax tree fingerprinting: a foundation for source code similarity detection

Abstract : Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Type de document :
Liste complète des métadonnées

Littérature citée [36 références]  Voir  Masquer  Télécharger
Contributeur : Etienne Duris <>
Soumis le : jeudi 29 septembre 2011 - 16:07:23
Dernière modification le : mercredi 4 juillet 2018 - 16:37:54
Document(s) archivé(s) le : mardi 13 novembre 2012 - 14:50:40


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-00627811, version 1


Michel Chilowicz, Étienne Duris, Gilles Roussel. Syntax tree fingerprinting: a foundation for source code similarity detection. 2009. 〈hal-00627811〉



Consultations de la notice


Téléchargements de fichiers