Authors: | Kanaris I., Kanaris K., Houvardas J., Stamatatos E. |
---|
Title: | Words vs. Character N-grams for Anti-spam Filtering |
---|
Journal: | Int. Journal on Artificial Intelligence Tools |
---|
Volume: | 16 |
---|
Number: | 6 |
---|
Pages: | 1047-1067 |
---|
Year: | 2007 |
---|
Publisher: | World Scientific |
---|
To appear: | No |
---|
Link: | http://dx.doi.org/10.1142/S0218213007003692 |
---|
ISI: | No |
---|
Impact Factor: | |
---|
File name: | |
---|
Abstract: | The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokenizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation. |