Octopus: University of The Aegean Research & Project Outputs System

Conference

Authors:	Stamatatos E.
Title:	Ensemble-based Author Identification Using Character N-grams
Conference:	3rd Int. Workshop on Text-based Information Retrieval (TIR)
Editors:
Ed:	No
Eds:	No
Pages:	41-46
To appear:	No
Month:
Year:	2006
Place:
Pubisher:
Link:
File name:
Abstract:	This paper deals with the problem of identifying the most likely author of a text. Several thousands of character n-grams, rather than lexical or syntactic information, are used to represent the style of a text. Thus, the author identification task can be viewed as a single-label multiclass classification problem of high dimensional feature space and sparse data. In order to cope with such properties, we propose a suitable learning ensemble based on feature set subspacing. Performance results on two well-tested benchmark text corpora for author identification show that this classification scheme is quite effective, significantly improving the best reported results so far. Additionally, this approach is proved to be quite stable in comparison with support vector machines when using limited number of training texts, a condition usually met in this kind of problem.