Συγγραφείς: | Frantzeskou G., Stamatatos E., Gritzalis S., Katsikas S. |
---|
Τίτλος: | Source Code Authorship Analysis using N-grams |
---|
Συνέδριο: | AIAI 2006 3rd IFIP Conference on Artificial Intelligence Applications and Innovations |
---|
Editors: | M. Bramer, I. Maglogiannis |
---|
Ed: | Όχι |
---|
Eds: | Ναι |
---|
Σελίδες: | 508-515 |
---|
Να εμφανιστεί: | Όχι |
---|
Μήνας: | Ιούνιος |
---|
Έτος: | 2006 |
---|
Τόπος: | Athens, Greece |
---|
Εκδότης: | Springer |
---|
Δεσμός: | https://www.utica.edu/academic/institutes/ecii/publications/articles/B41158D1-C829-0387-009D214D2170C321.pdf |
---|
Όνομα αρχείου: | |
---|
Περίληψη: | Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the sys-tem after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of dif-ferent programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idio-syncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of com-ments in the source code, a condition usually met in cyber-crime cases. |