Συνέδριο

Συγγραφείς: Kanaris I., Kanaris K., Stamatatos E.
Τίτλος: Spam Detection Using Character N-grams
Συνέδριο: 4th Hellenic Conference on AI (SETN 2006): Advances in Artificial Intelligence
Editors: G. Antoniou, G. Potamias, C. Spyropoulos, D. Plexousakis
Ed: Όχι
Eds: Ναι
Σελίδες: 95–104
Να εμφανιστεί: Όχι
Μήνας:
Έτος: 2006
Τόπος:
Εκδότης:
Δεσμός:
Όνομα αρχείου:
Περίληψη: This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.