Spam detector

root

DROPPY HOOVES

Droppy Hooves is a spam detector. It works by extracting unigrams, bigrams and trigrams from plain text and HTML e-mails and comparing them with its trained database.

Spammicity

An item's spammicity (I'm using 'item' as a generalization of unigrams, bigrams and trigrams) is the number of its occurences in spam mails divided by its total occurences. The total spammicity is a product of its items' spammicities, normalized by the factor count (a.k.a. geometric mean).

Tokenization

In HTML e-mails one tag is one token. So even though "<font face=Verdana>" is separated by spaces, the tokenizer does not split it. The end tag "</font>" however, is a token on its own. In plain text, < and > are treated as individual tokens. After tokenization, a set of stop words is stripped, like "a", "the", "is", since they will not help us distinguishing junk from funk.

Database

At the moment Droppy Hooves supports SQLite and Postgres databases. As there are no configuration files yet, you need to adjust Main.hs manually (sorry for that; it's on the to-do list!). If you are using Postgres, please make sure that it's version 9.5 or later, as earlier versions don't support ON-CONFLICT clauses and training will fail.

If you're sure you want to use an earlier version, edit Train.hs and comment the postgresql line in insertIgnore, such that the fallback case will be used. But don't blame me for the lame performance then!

If you want to use MySQL/MariaDB, you'll need to import the persistent-mysql package, and adjust Main.hs. You'll probably also want to write a MySQL-specific implementation of insertIgnore (see Train.hs; I think the right syntax should be INSERT IGNORE ...). That's optional, but training performance will be poor with the fallback case.

The table is called 'spam_phrases'. If you want to change that, edit Persistable.hs.

Training

Droppy can be trained like the following:

droppyhooves train junk <SpamMail.eml
droppyhooves train funk <LegitimateMail.eml

Evaluation

An e-mail can be evaluated like the following:

droppyhooves eval <YourMail.eml

Output will be a percentage number or 'Unknown', in case that no item was known in the database or no text was contained in the e-mail at all.

Bugs

At the moment, Droppy still has problems parsing some mails. I'll fix that.

Future

I'm thinking about analyzing the Subject header as well. Another idea would be to build statistics about the servers in Received headers. Also, binary attachments could be split in 1kB chunks their hashes be put in a database table to recognize common spam attachments. For more, see the issue tracker.