Abstract: | Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be
seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for
some of the candidate authors or there is a significant variation in the text-length among the available training texts of the
candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts
over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle
imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to
the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short
samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training
set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data
that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in
English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multiclass
imbalanced cases that reveal the properties of the presented methods. |