Octopus: University of The Aegean Research & Project Outputs System

Συνέδριο

Συγγραφείς:	Stamatatos E.
Τίτλος:	Ensemble-based Author Identification Using Character N-grams
Συνέδριο:	3rd Int. Workshop on Text-based Information Retrieval (TIR)
Editors:
Ed:	Όχι
Eds:	Όχι
Σελίδες:	41-46
Να εμφανιστεί:	Όχι
Μήνας:
Έτος:	2006
Τόπος:
Εκδότης:
Δεσμός:
Όνομα αρχείου:
Περίληψη:	This paper deals with the problem of identifying the most likely author of a text. Several thousands of character n-grams, rather than lexical or syntactic information, are used to represent the style of a text. Thus, the author identification task can be viewed as a single-label multiclass classification problem of high dimensional feature space and sparse data. In order to cope with such properties, we propose a suitable learning ensemble based on feature set subspacing. Performance results on two well-tested benchmark text corpora for author identification show that this classification scheme is quite effective, significantly improving the best reported results so far. Additionally, this approach is proved to be quite stable in comparison with support vector machines when using limited number of training texts, a condition usually met in this kind of problem.