make_new_database(1) Expaminator make_new_database(1) NAME make_new_database SYNOPSIS make_new_database [-v] [-c config-file] DESCRIPTION Make a new 'probability' database for the Bayesian fil ter(s). make_new_database creates a dictionary of words found in spam and normal messages, assigning a probability value to each, based on the frequency of occurrence in spam versus normal messages. make_new_database must be run at least once before using the Bayesean filters, and should be run at reasonable intervals thereafter. The length of a "rea sonable" interval depends upon the mutation rate of incom ing spam, but will probably fall somewhere from a week to a month or so. It may be run as a cron job, and runs to completion without interfering with the filters, though it may impose a heavy CPU load. The newly-created files are made unique by appending the current date and time to the name. After all else is done, symbolic links used by the filters are deleted and re-created, pointing to these new files. No attempt is made to clean out old files; this can be done manually or by a cron job running every few months. Most of the actual processing is actually done by two lower-level perl scripts, ´create_word_hash' and 'cre ate_probability_hash'. [ "normal" msg directory 1 ] [ spam directory 1 ] : : : : [ "normal" msg directory N ] [ spam directory M ] | | | | V V create_word_hash create_word_hash | | | | V V [ normalwordhash ] [ spamwordhash ] | | | | --------------------------- | | V create_probability_hash | | V [ probability hash ] Command-line options: -h Help; print the command-line options and exit. -v be verbose. -c specify a Configuration file. If '-c config-file' is omitted, the environment variable 'SPAMCONFIG' is used. CONFIGURATION make_new_database's configuration file is composed of sim ple keyword-value pairs, one pair per line. Keywords are not case-sensitive; keyword and value are separated by one or more spaces or tabs. A comment symbol, '#' anywhere on a line causes all following text to be ignored. send mail_bayes will stop scanning for a keyword at the first occurence in the file. This configuration file is shared by other database main tenance and testing utilities, and the spam-filters them selves. normal_messages_dir <directory> Required. The name of a directory containing "normal", non-spam, messages. There may be as many of these lines in the configuration file as desired. Each directory is processed recursively, however, so no specified directory should be beneath any other specified. normalwordhash <filename> Required. This is the name of the symbolic link to the actual probability hash. probabilityhash <filename> Required. This is the name of the symbolic link to the actual probability hash. spamwordhash <filename> Required. This is the name of the symbolic link to the actual probability hash. spamdatadir <directory> Required. The directory containing the normal-word hash, spam-hash, probability hash, and the optional username hash. spam_messages_dir <directory> Required. The name of a directory containing spam mes sages. There may be as many of these lines in the configuration file as desired. Each directory is processed recursively, however, so no specified directory should be beneath any other specified. updatelockfile <filename> Required. This is the name of a file in 'spamdatadir' which is briefly created by make_new_database before re- creating the probability-hash symbolic link. Newly-cre ated filter processes will delay up to 5 seconds before beginning to run if this file has somehow been left in the directory; it should therefore be given an obviously "bad" name. ENVIRONMENT $SPAMCONFIG can be used to supply the full pathname of the configuration file. (The ´-c config-file´ option will override $SPAMCONFIG) FILES Required: a configuration file and at least one directory containing "mormal" messages, and at least one containing spams. SEE ALSO create_word_hash, create_probability_hash COPYRIGHT Copyright (c) 2002, J.B.Ward <bward2@users.sourceforge.net> Expaminator Nov.28,2002 make_new_database(1)