Περίληψη: | Authorship identification can be seen as a single-label
multi-class text categorization problem. Very often, there are
extremely few training texts at least for some of the candidate
authors. In this paper, we present methods to handle imbalanced
multi-class textual datasets. The main idea is to segment the
training texts into sub-samples according to the size of the class.
Hence, minority classes can be segmented into many short samples
and majority classes into less and longer samples. Moreover, we
explore text re-sampling in order to construct a training set
according to a desirable distribution over the classes. Essentially,
text re-sampling can be viewed as providing new synthetic data that
increase the training size of a class. Based on a corpus of newswire
stories in English we present authorship identification experiments
on various multi-class imbalanced cases. |