Abstract: | Source code author identification deals with identifying the most likely author of a computer
program, given a set of predefined author candidates. There are several scenarios where
digital evidence of this kind plays a role in investigation and adjudication, such as code
authorship disputes, intellectual property infringement, tracing the source of code left in the
system after a cyber attack, and so forth. As in any identification task, the disputed program is
compared to undisputed, known programming samples by the predefined author candidates.
We present a new approach, called the SCAP (Source Code Author Profiles) approach, based
on byte-level n-gram profiles representing the source code author’s style. The SCAP method
extends a method originally applied to natural language text authorship attribution; we show
that an n-gram approach also suits the characteristics of source code analysis. The
methodological extension includes a simplified profile and a less complicated, but more
effective, similarity measure. Experiments on data sets of different programming-language
(Java or C++) and commented/commentless code demonstrate the effectiveness of these
extensions. The SCAP approach is programming-language independent. Moreover, the SCAP
approach deals surprisingly well with cases where only a limited amount of very short
programs per programmer is available for training. Finally, it is also demonstrated that SCAP
effectiveness persists even in the absence of comments in the source code, a condition
usually met in cyber-crime cases. |