Abstract: | A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast
majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and low-level representation is equally effective in realistic conditions where some of the above factors are not possible to remain stable. In this study, the robustness of
authorship attribution based on character n-gram features is tested under cross-genre and cross-topic conditions. In addition, the distribution of texts over the candidate authors varies in
training and test corpora to imitate real cases. Comparative results with another competitive text representation approach based on very frequent words show that character n-grams are better able to capture stylistic properties of text when there are significant differences among the training and test corpora. Moreover, a set of guidelines to tune an authorship attribution model according to the properties of training and test corpora is
provided. |