make_new_database(1)       Expaminator       make_new_database(1)



NAME
       make_new_database

SYNOPSIS
       make_new_database [-v]  [-c config-file]

DESCRIPTION
       Make  a  new  'probability' database for the Bayesian fil­
       ter(s).
       make_new_database creates a dictionary of words  found  in
       spam and normal messages, assigning a probability value to
       each, based on the frequency of occurrence in spam  versus
       normal  messages.   make_new_database must be run at least
       once before using the Bayesean filters, and should be  run
       at reasonable intervals thereafter.  The length of a "rea­
       sonable" interval depends upon the mutation rate of incom­
       ing  spam, but will probably fall somewhere from a week to
       a month or so.  It may be run as a cron job, and  runs  to
       completion without interfering with the filters, though it
       may impose a heavy CPU load.

       The newly-created files are made unique by  appending  the
       current  date  and  time  to  the name.  After all else is
       done, symbolic links used by the filters are  deleted  and
       re-created,  pointing  to  these new files.  No attempt is
       made to clean out old files; this can be done manually  or
       by a cron job running every few months.

       Most  of  the  actual  processing  is actually done by two
       lower-level perl  scripts,  ´create_word_hash'  and  'cre­
       ate_probability_hash'.



       [ "normal" msg directory 1 ]   [ spam directory 1 ]
              :             :             :        :
       [ "normal" msg directory N ]   [ spam directory M ]
                      |                         |
                      |                         |
                      V                         V
               create_word_hash           create_word_hash
                      |                         |
                      |                         |
                      V                         V
              [ normalwordhash ]         [ spamwordhash ]
                      |                         |
                      |                         |
                      ---------------------------
                                   |
                                   |
                                   V
                         create_probability_hash
                                   |
                                   |
                                   V
                          [ probability hash ]



       Command-line options:

       -h Help; print the command-line options and exit.

       -v be verbose.

       -c  specify a Configuration file.   If '-c config-file' is
          omitted, the environment variable 'SPAMCONFIG' is used.


CONFIGURATION
       make_new_database's configuration file is composed of sim­
       ple keyword-value pairs, one pair per line.  Keywords  are
       not case-sensitive; keyword and value are separated by one
       or more spaces or tabs.   A comment symbol,  '#'  anywhere
       on  a line causes all following text to be ignored.  send­
       mail_bayes will stop scanning for a keyword at  the  first
       occurence in the file.
       This  configuration file is shared by other database main­
       tenance and testing utilities, and the spam-filters  them­
       selves.


       normal_messages_dir <directory>
       Required.   The  name  of a directory containing "normal",
       non-spam, messages.
       There may be as many of these lines in  the  configuration
       file as desired.  Each directory is processed recursively,
       however, so no specified directory should be  beneath  any
       other specified.

       normalwordhash <filename>
       Required.   This  is  the name of the symbolic link to the
       actual probability hash.

       probabilityhash <filename>
       Required.  This is the name of the symbolic  link  to  the
       actual probability hash.

       spamwordhash <filename>
       Required.   This  is  the name of the symbolic link to the
       actual probability hash.

       spamdatadir <directory>
       Required.  The directory containing the normal-word  hash,
       spam-hash,  probability  hash,  and  the optional username
       hash.

       spam_messages_dir <directory>
       Required.  The name of a directory  containing  spam  mes­
       sages.
       There  may  be as many of these lines in the configuration
       file as desired.  Each directory is processed recursively,
       however,  so  no specified directory should be beneath any
       other specified.

       updatelockfile <filename>
       Required.  This is the name of  a  file  in  'spamdatadir'
       which  is  briefly created by make_new_database before re-
       creating the probability-hash symbolic  link.   Newly-cre­
       ated  filter  processes  will delay up to 5 seconds before
       beginning to run if this file has somehow been left in the
       directory; it should therefore be given an obviously "bad"
       name.



ENVIRONMENT
       $SPAMCONFIG can be used to supply the full pathname of the
       configuration  file.   (The  ´-c  config-file´ option will
       override $SPAMCONFIG)


FILES
       Required:  a configuration file and at least one directory
       containing  "mormal" messages, and at least one containing
       spams.


SEE ALSO
       create_word_hash, create_probability_hash


COPYRIGHT
       Copyright (c) 2002, J.B.Ward
       <bward2@users.sourceforge.net>




Expaminator                Nov.28,2002       make_new_database(1)