Συγγραφείς: | Kanaris I., Kanaris K., Stamatatos E. |
---|
Τίτλος: | Spam Detection Using Character N-grams |
---|
Συνέδριο: | 4th Hellenic Conference on AI (SETN 2006): Advances in Artificial Intelligence |
---|
Editors: | G. Antoniou, G. Potamias, C. Spyropoulos, D. Plexousakis |
---|
Ed: | Όχι |
---|
Eds: | Ναι |
---|
Σελίδες: | 95–104 |
---|
Να εμφανιστεί: | Όχι |
---|
Μήνας: | |
---|
Έτος: | 2006 |
---|
Τόπος: | |
---|
Εκδότης: | |
---|
Δεσμός: | |
---|
Όνομα αρχείου: | |
---|
Περίληψη: | This paper presents a content-based approach to spam detection
based on low-level information. Instead of the traditional 'bag of words' representation,
we use a 'bag of character n-grams' representation which avoids the
sparse data problem that arises in n-grams on the word-level. Moreover, it is
language-independent and does not require any lemmatizer or 'deep' text preprocessing.
Based on experiments on Ling-Spam corpus we evaluate the proposed
representation in combination with support vector machines. Both binary
and term-frequency representations achieve high precision rates while maintaining
recall on equally high level, which is a crucial factor for anti-spam filters, a
cost sensitive application. |