Syntax tree fingerprinting: a foundation for source code similarity detection

Abstract : Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Type de document :
Rapport
2009
Liste complète des métadonnées

Littérature citée [36 références]  Voir  Masquer  Télécharger

https://hal-upec-upem.archives-ouvertes.fr/hal-00627811
Contributeur : Etienne Duris <>
Soumis le : jeudi 29 septembre 2011 - 16:07:23
Dernière modification le : mercredi 4 juillet 2018 - 16:37:54
Document(s) archivé(s) le : mardi 13 novembre 2012 - 14:50:40

Fichier

HAL.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00627811, version 1

Citation

Michel Chilowicz, Étienne Duris, Gilles Roussel. Syntax tree fingerprinting: a foundation for source code similarity detection. 2009. 〈hal-00627811〉

Partager

Métriques

Consultations de la notice

437

Téléchargements de fichiers

2505