Dec.23, 2002
Notice: this code is alpha quality, and has so far been tested
only under Linux.
It should, however, also function under FreeBSD,
Solaris, and (probably) Mac OSX.
The expaminator is an SMTP server-side spam filter based on
Paul Graham's article,
"A Plan for Spam".
Currently, filters for sendmail and
postfix have been written. Others,
possibly starting with qmail, will follow.
Command-line tools for spam-database management and testing are included.
The actual filters
Copies of the expaminator are available at:
sourceforge.net/projects/expaminator.
Untar it and enter the directory corresponding to the SMTP agent you are
using; for sendmail, 'cd expaminator/src/sendmail'; for postfix,
'cd expaminator/src/postfix', and so on.
If you can live with the binaries being installed in /usr/local/bin and
/usr/local/sbin, and the man pages under /usr/local/man, then enter:
./configure
make
make install
If you have other preferences, read expaminator/INSTALL and do "./configure --help" while in expaminator/src/..whatever.. ; this will explain the many different options available with 'configure'.
First, catch some spam. :) This is usually not too difficult, but what you really need is a large body of recent, representative, and varied spam. There are several archives of spam on the web, such as Bruce Guenter's www.em.ca/~bruceg/spam, and the soon-to-be opened SpamArchive.org.
You will also need a source of normal, non-spam email. This can be a problem. Most people dislike the notion of others getting hold of their personal mail, and rightly so. However, a large body of this hard-to-come-by commodity is necessary if you want to avoid high rates of spam leakage, or, =much worse=, false positives. A false positive means that you just bounced perfectly innocent email that someone just might badly want. You are bouncing, and not silently discarding, aren't you?
Figuring out just how to get hold of a reasonably representative body of normal mail will probably be the trickiest part of running a Bayesean filter for any large group of users. I leave this problem.. lessee.. "as an exercise for the reader". Yeah, that's the ticket!
It is probably wise to give users the option of not having their mail filtered at all. Expaminator supports a simple user-address hash, and filters mail only for recipients found in it.
For those that do opt in, it might be an idea to provide an address to which they can forward spam which has leaked through the filter. This can then be periodically added to the db hashes with the 'addspam' utility. Unfortunately, it will be necessary to visually check all such forwards to prevent malicious contamination of the spam hash.
Here's an attempt to show the overall workings of the expaminator; 'create_probability_hash' reads messages from a number of spam-directories and normal-message directories, creating a probabilty hash. A filter uses this hash. That fiddly bit at the lower left shows 'addspam' reading new spam and updating two hashes.
[ "normal" msg directory 1 ] [ spam directory 1 ] : : : : [ "normal" msg directory N ] [ spam directory M ] | | | | | | `-----------. ,------' | | | | V V create_probability_hash | | ,------------------+----------------. | | | | | | V V V [ normalwordhash ] [ spamwordhash ] [ probability hash ] | ^ | ^ | ,-------------. | ,--|------' | | | | | | | | | [ new spam ] V V V ^ | | addspam | ^ | | | | | V____|________| | | V filterThe general idea is this...
Use 'grade' and/or 'shovelmail' to test and fiddle with the various tuning parameters in the config file until you're happy with them.
Create a user-hash, and add the addresses of all users brave enough to trust their mail to a statistical test.
Make a forwarding account to which they can send leaked spam. Every few days, scan the spam to make sure it's genuine, and use 'addspam' to keep the database up-to-date. Shovel this new spam into one of the spam directories used by 'create_probability_hash', ready for the next cron-job run. Then thank Paul Graham for being a really smart guy.