Octopus: University of The Aegean Research & Project Outputs System

Journal

Authors:	Frantzeskou G., Stamatatos E., Gritzalis S., Chaski C., Howald B.
Title:	Identifying Authorship by Byte Level n-grams: The Source Code Author Profile (SCAP) Method
Journal:	International Journal of Digital Evidence
Volume:	6
Number:	1
Pages:
Year:	2007
Publisher:
To appear:	No
Link:	http://www.utica.edu/academic/institutes/ecii/publications/articles/B41158D1-C829-0387-009D214D2170C321.pdf
ISI:	No
Impact Factor:
File name:
Abstract:	Source code author identification deals with identifying the most likely author of a computer program, given a set of predefined author candidates. There are several scenarios where digital evidence of this kind plays a role in investigation and adjudication, such as code authorship disputes, intellectual property infringement, tracing the source of code left in the system after a cyber attack, and so forth. As in any identification task, the disputed program is compared to undisputed, known programming samples by the predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles representing the source code author’s style. The SCAP method extends a method originally applied to natural language text authorship attribution; we show that an n-gram approach also suits the characteristics of source code analysis. The methodological extension includes a simplified profile and a less complicated, but more effective, similarity measure. Experiments on data sets of different programming-language (Java or C++) and commented/commentless code demonstrate the effectiveness of these extensions. The SCAP approach is programming-language independent. Moreover, the SCAP approach deals surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. Finally, it is also demonstrated that SCAP effectiveness persists even in the absence of comments in the source code, a condition usually met in cyber-crime cases.