Authors: | Houvardas J., Stamatatos E. |
---|
Title: | N-gram Feature Selection for Authorship Identification |
---|
Conference: | 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications (AIMSA |
---|
Editors: | J. Euzenat, and J. Domingue |
---|
Ed: | No |
---|
Eds: | Yes |
---|
Pages: | 77-86 |
---|
To appear: | No |
---|
Month: | |
---|
Year: | 2006 |
---|
Place: | |
---|
Pubisher: | |
---|
Link: | |
---|
File name: | |
---|
Abstract: | Automatic authorship identification offers a valuable tool for
supporting crime investigation and security. It can be seen as a multi-class,
single-label text categorization task. Character n-grams are a very successful
approach to represent text for stylistic purposes since they are able to capture
nuances in lexical, syntactical, and structural level. So far, character n-grams of
fixed length have been used for authorship identification. In this paper, we
propose a variable-length n-gram approach inspired by previous work for
selecting variable-length word sequences. Using a subset of the new Reuters
corpus, consisting of texts on the same topic by 50 different authors, we show
that the proposed approach is at least as effective as information gain for
selecting the most significant n-grams although the feature sets produced by the
two methods have few common members. Moreover, we explore the
significance of digits for distinguishing between authors showing that an
increase in performance can be achieved using simple text pre-processing. |