Octopus: University of The Aegean Research & Project Outputs System

Συνέδριο

Συγγραφείς:	Kanaris I., Stamatatos E.
Τίτλος:	Webpage Genre Identification Using Variable-length Character n-grams
Συνέδριο:	19th IEEE Int. Conf. on Tools with Artificial Intelligence (ICTAI
Editors:
Ed:	Όχι
Eds:	Όχι
Σελίδες:
Να εμφανιστεί:	Όχι
Μήνας:
Έτος:	2007
Τόπος:
Εκδότης:
Δεσμός:
Όνομα αρχείου:
Περίληψη:	An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character ngrams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.