make_new_database(1) Expaminator make_new_database(1)
NAME
make_new_database
SYNOPSIS
make_new_database [-v] [-c config-file]
DESCRIPTION
Make a new 'probability' database for the Bayesian fil
ter(s).
make_new_database creates a dictionary of words found in
spam and normal messages, assigning a probability value to
each, based on the frequency of occurrence in spam versus
normal messages. make_new_database must be run at least
once before using the Bayesean filters, and should be run
at reasonable intervals thereafter. The length of a "rea
sonable" interval depends upon the mutation rate of incom
ing spam, but will probably fall somewhere from a week to
a month or so. It may be run as a cron job, and runs to
completion without interfering with the filters, though it
may impose a heavy CPU load.
The newly-created files are made unique by appending the
current date and time to the name. After all else is
done, symbolic links used by the filters are deleted and
re-created, pointing to these new files. No attempt is
made to clean out old files; this can be done manually or
by a cron job running every few months.
Most of the actual processing is actually done by two
lower-level perl scripts, ´create_word_hash' and 'cre
ate_probability_hash'.
[ "normal" msg directory 1 ] [ spam directory 1 ]
: : : :
[ "normal" msg directory N ] [ spam directory M ]
| |
| |
V V
create_word_hash create_word_hash
| |
| |
V V
[ normalwordhash ] [ spamwordhash ]
| |
| |
---------------------------
|
|
V
create_probability_hash
|
|
V
[ probability hash ]
Command-line options:
-h Help; print the command-line options and exit.
-v be verbose.
-c specify a Configuration file. If '-c config-file' is
omitted, the environment variable 'SPAMCONFIG' is used.
CONFIGURATION
make_new_database's configuration file is composed of sim
ple keyword-value pairs, one pair per line. Keywords are
not case-sensitive; keyword and value are separated by one
or more spaces or tabs. A comment symbol, '#' anywhere
on a line causes all following text to be ignored. send
mail_bayes will stop scanning for a keyword at the first
occurence in the file.
This configuration file is shared by other database main
tenance and testing utilities, and the spam-filters them
selves.
normal_messages_dir <directory>
Required. The name of a directory containing "normal",
non-spam, messages.
There may be as many of these lines in the configuration
file as desired. Each directory is processed recursively,
however, so no specified directory should be beneath any
other specified.
normalwordhash <filename>
Required. This is the name of the symbolic link to the
actual probability hash.
probabilityhash <filename>
Required. This is the name of the symbolic link to the
actual probability hash.
spamwordhash <filename>
Required. This is the name of the symbolic link to the
actual probability hash.
spamdatadir <directory>
Required. The directory containing the normal-word hash,
spam-hash, probability hash, and the optional username
hash.
spam_messages_dir <directory>
Required. The name of a directory containing spam mes
sages.
There may be as many of these lines in the configuration
file as desired. Each directory is processed recursively,
however, so no specified directory should be beneath any
other specified.
updatelockfile <filename>
Required. This is the name of a file in 'spamdatadir'
which is briefly created by make_new_database before re-
creating the probability-hash symbolic link. Newly-cre
ated filter processes will delay up to 5 seconds before
beginning to run if this file has somehow been left in the
directory; it should therefore be given an obviously "bad"
name.
ENVIRONMENT
$SPAMCONFIG can be used to supply the full pathname of the
configuration file. (The ´-c config-file´ option will
override $SPAMCONFIG)
FILES
Required: a configuration file and at least one directory
containing "mormal" messages, and at least one containing
spams.
SEE ALSO
create_word_hash, create_probability_hash
COPYRIGHT
Copyright (c) 2002, J.B.Ward
<bward2@users.sourceforge.net>
Expaminator Nov.28,2002 make_new_database(1)