Abstract: | Automated Genre Identification (AGI) of web pages is a
problem of increasing importance since web genre (e.g. blog, news, eshops,
etc.) information can enhance modern Information Retrieval (IR)
systems. The state-of-the-art in this field considers AGI as a closed-set
classification problem where a variety of web page representation and machine
learning models have intensively studied. In this paper, we study
AGI as an open-set classification problem which better formulates the
real world conditions of exploiting AGI in practice. Focusing on the use
of content information, different text representation methods (words and
character n-grams) are tested. Moreover, two classification methods are
examined, one-class SVM learners, used as a baseline, and an ensemble
of classifiers based on random feature subspacing, originally proposed for
author identification. It is demonstrated that very high precision can be
achieved in open-set AGI while recall remains relatively high. |