Abstract: | The vast amount of user-generated content on the Web has
increased the need for handling the problem of automatically
processing content in web pages. The segmentation of web
pages and noise (non-informative segment) removal are important
pre-processing steps in a variety of applications such
as sentiment analysis, text summarization and information
retrieval. Currently, these two tasks tend to be handled separately
or are handled together without emphasizing the diversity
of the web corpora and the web page type detection.
We present a unified approach that is able to provide robust
identification of informative textual parts in web pages
along with accurate type detection. The proposed algorithm
takes into account visual and non-visual characteristics of a
web page and is able to remove noisy parts from three major
categories of pages which contain user-generated content
(News, Blogs, Discussions). Based on a human annotated
corpus consisting of diverse topics, domains and templates,
we demonstrate the learning abilities of our algorithm, we
examine its e↵ectiveness in extracting the informative textual
parts and its usage as a rule-based classifier for web
page type detection in a realistic web setting. |