The Expaminator

Dec.23, 2002
Notice: this code is alpha quality, and has so far been tested only under Linux.
It should, however, also function under FreeBSD, Solaris, and (probably) Mac OSX.

The expaminator is an SMTP server-side spam filter based on Paul Graham's article, "A Plan for Spam".
Currently, filters for sendmail and postfix have been written. Others, possibly starting with qmail, will follow.

Command-line tools for spam-database management and testing are included.

All the bits and pieces, so far:

The actual filters

sendmail_bayes - a 'milter' for sendmail
postfix_bayes - a content-filter daemon for postfix

Spam database utilities

make_new_database - Creates new berkeley db hash files from "normal" email and spam files, suitable for use by the filters. 'make_new_database' may be run as a (perhaps weekly) cron job to keep the spam database current; it is not necessary to stop the filters during database updates.
create_word_hash and create_probability_hash - intermediate perl scripts executed by 'make_new_database'.
addspam - add newly-found spam to the existing database.
create_user_hash, addto_user_hash, and delfrom_user_hash - short perl scripts to manage a simple 'opt-in' database for users.

Testing utilities

grade - a command-line script to "grade" email messages as to how much they look like spam. This permits experimentation with various parameters to fine-tune the filters.
peekhash - lets you "peek" at values in any of the database hashes
dumphash - dumps an entire database hash to standard output, sorted by keys or values.
shovelmail - "shovels" text files into an SMTP server. Useful for load testing, or just seeing if everything works as expected.

Miscellaneous utilities

tidy_up - Clean up directories full of email, discarding spam. Probably not that useful, but what the heck...

Configuration

expaminator.conf - Various configurable parameters

Installing the Expaminator

Copies of the expaminator are available at: sourceforge.net/projects/expaminator.
Untar it and enter the directory corresponding to the SMTP agent you are using; for sendmail, 'cd expaminator/src/sendmail'; for postfix, 'cd expaminator/src/postfix', and so on.

If you can live with the binaries being installed in /usr/local/bin and /usr/local/sbin, and the man pages under /usr/local/man, then enter:
./configure
make
make install

If you have other preferences, read expaminator/INSTALL and do "./configure --help" while in expaminator/src/..whatever.. ; this will explain the many different options available with 'configure'.

Using the Expaminator

First, catch some spam. :) This is usually not too difficult, but what you really need is a large body of recent, representative, and varied spam. There are several archives of spam on the web, such as Bruce Guenter's www.em.ca/~bruceg/spam, and the soon-to-be opened SpamArchive.org.

You will also need a source of normal, non-spam email. This can be a problem. Most people dislike the notion of others getting hold of their personal mail, and rightly so. However, a large body of this hard-to-come-by commodity is necessary if you want to avoid high rates of spam leakage, or, =much worse=, false positives. A false positive means that you just bounced perfectly innocent email that someone just might badly want. You are bouncing, and not silently discarding, aren't you?

Figuring out just how to get hold of a reasonably representative body of normal mail will probably be the trickiest part of running a Bayesean filter for any large group of users. I leave this problem.. lessee.. "as an exercise for the reader". Yeah, that's the ticket!

It is probably wise to give users the option of not having their mail filtered at all. Expaminator supports a simple user-address hash, and filters mail only for recipients found in it.

For those that do opt in, it might be an idea to provide an address to which they can forward spam which has leaked through the filter. This can then be periodically added to the db hashes with the 'addspam' utility. Unfortunately, it will be necessary to visually check all such forwards to prevent malicious contamination of the spam hash.

Here's an attempt to show the overall workings of the expaminator; 'create_probability_hash' reads messages from a number of spam-directories and normal-message directories, creating a probabilty hash. A filter uses this hash. That fiddly bit at the lower left shows 'addspam' reading new spam and updating two hashes.


      [ "normal" msg directory 1 ]   [ spam directory 1 ]
             :             :             :        :
      [ "normal" msg directory N ]   [ spam directory M ]
                     |                         |
                     |                         |
                     |                         |
                     `-----------.      ,------'
                                 |      |
                                 |      |
                                 V      V
                         create_probability_hash 
                                   |
                                   |
                ,------------------+----------------.
                |                  |                |
                |                  |                |
                V                  V                V
        [ normalwordhash ] [ spamwordhash ] [ probability hash ]
                                 |     ^      | ^   |
                ,-------------.  |  ,--|------' |   |
                |             |  |  |  |        |   |
           [ new spam ]       V  V  V  ^        |   |
                              addspam  |        ^   |
                                  |    |        |   |
                                  V____|________|   |
                                                    |
                                                    V
                                                  filter

The general idea is this...
Configure your filter, and set up whatever MTA you have to match. Put a mess of spam in one or more directories, and a large number of typical non-spammy messages in another set of directories. Run 'create_probability_hash' once, and set it up as a cron job that runs maybe once a week or once a month.

Use 'grade' and/or 'shovelmail' to test and fiddle with the various tuning parameters in the config file until you're happy with them.

Create a user-hash, and add the addresses of all users brave enough to trust their mail to a statistical test.

Make a forwarding account to which they can send leaked spam. Every few days, scan the spam to make sure it's genuine, and use 'addspam' to keep the database up-to-date. Shovel this new spam into one of the spam directories used by 'create_probability_hash', ready for the next cron-job run. Then thank Paul Graham for being a really smart guy.