Authors: | Kanaris I., Kanaris K., Stamatatos E. |
---|
Title: | Spam Detection Using Character N-grams |
---|
Conference: | 4th Hellenic Conference on AI (SETN 2006): Advances in Artificial Intelligence |
---|
Editors: | G. Antoniou, G. Potamias, C. Spyropoulos, D. Plexousakis |
---|
Ed: | No |
---|
Eds: | Yes |
---|
Pages: | 95–104 |
---|
To appear: | No |
---|
Month: | |
---|
Year: | 2006 |
---|
Place: | |
---|
Pubisher: | |
---|
Link: | |
---|
File name: | |
---|
Abstract: | This paper presents a content-based approach to spam detection
based on low-level information. Instead of the traditional 'bag of words' representation,
we use a 'bag of character n-grams' representation which avoids the
sparse data problem that arises in n-grams on the word-level. Moreover, it is
language-independent and does not require any lemmatizer or 'deep' text preprocessing.
Based on experiments on Ling-Spam corpus we evaluate the proposed
representation in combination with support vector machines. Both binary
and term-frequency representations achieve high precision rates while maintaining
recall on equally high level, which is a crucial factor for anti-spam filters, a
cost sensitive application. |