Abstract: | In this paper a novel method for detecting plagiarized passages in
document collections is presented. In contrast to previous work in
this field that uses mainly content terms to represent documents,
the proposed method is based on structural information provided
by occurrences of a small list of stopwords (i.e., very frequent
words). We show that stopword n-grams are able to capture local
syntactic similarities between suspicious and original documents.
Moreover, an algorithm for detecting the exact boundaries of
plagiarized and source passages is proposed. Experimental results
on a publicly-available corpus demonstrate that the performance
of the proposed approach is competitive when compared with the
best reported results. More importantly, it achieves significantly
better results when dealing with difficult plagiarism cases where
the plagiarized passages are highly modified by replacing most of
the words or phrases with synonyms to hide the similarity with the
source documents. |