Octopus: University of The Aegean Research & Project Outputs System

Conference

Authors:	Pritsos D., Stamatatos E.
Title:	Open-Set Classification for Automated Genre Identification
Conference:	Advances in Information Retrieval - 35th European Conference on IR Research (ECIR 2013)
Editors:
Ed:	No
Eds:	No
Pages:	207-217
To appear:	No
Month:
Year:	2013
Place:
Pubisher:	Springer LNCS
Link:
File name:
Abstract:	Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, eshops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.