Περίληψη: | An important factor for discriminating between
webpages is their genre (e.g., blogs, personal homepages,
e-shops, online newspapers, etc). Webpage genre
identification has a great potential in information
retrieval since users of search engines can combine
genre-based and traditional topic-based queries to
improve the quality of the results. So far, various features
have been proposed to quantify the style of webpages
including word and html-tag frequencies. In this paper,
we propose a low-level representation for this problem
based on character n-grams. Using an existing approach,
we produce feature sets of variable-length character ngrams
and combine this representation with information
about the most frequent html-tags. Based on two
benchmark corpora, we present webpage genre
identification experiments and improve the best reported
results in both cases. |