Abstract: | This paper deals with the problem of identifying the
most likely author of a text. Several thousands of character n-grams,
rather than lexical or syntactic information, are used to represent the
style of a text. Thus, the author identification task can be viewed as
a single-label multiclass classification problem of high dimensional
feature space and sparse data. In order to cope with such properties,
we propose a suitable learning ensemble based on feature set
subspacing. Performance results on two well-tested benchmark text
corpora for author identification show that this classification
scheme is quite effective, significantly improving the best reported
results so far. Additionally, this approach is proved to be quite
stable in comparison with support vector machines when using
limited number of training texts, a condition usually met in this kind
of problem. |