Conference

Authors: Pritsos D., Stamatatos E.
Title: Open-Set Classification for Automated Genre Identification
Conference: Advances in Information Retrieval - 35th European Conference on IR Research (ECIR 2013)
Editors:
Ed: No
Eds: No
Pages: 207-217
To appear: No
Month:
Year: 2013
Place:
Pubisher: Springer LNCS
Link:
File name:
Abstract: Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, eshops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.