Re: BayesIt filtering: Spam error high

DZ-Jay Sun, 08 Aug 2004 06:13:43 -0700

Some time around 08/07/2004 18:21:10, I think I heard Peter Kerekes say:
<!SNIP!>
> I haven't much clue what the lines in advance.ini do and therefore do not
> want to experiment with it.


> Can anyone suggest a change in any settings to improve filtration?

> Advance.ini file:

>> working thread priority="2"
>> onexit thread priority="3"
>> selective download spam threshold="10"
>> export selective download="1"
>> simple digits spam marks="1"
>> no spaces spam marks="1"
>> limit size to hash="19"
>> limit size to hash header="96"
>> temporary dictionary="c:\\temp"
>> use expiration="0"
>> age to expirate="100"
>> learn from zero="1"
>> max size of log file="131072"
>> recalculating strategy="3"
>> regarding threshold="1.5"
>> use autotrain="1"
>> use degeneration="1"
>> number of exclamations="5"
<!SNIP!>

Hello:
        I've been looking for a while on help tuning my BayesIt installation, but I 
can't seem to find much help, even though I search the archives of this list and the 
web.  I used to use PopFile and became pretty proficient at tuning it, even editing 
the corpus by hand, but even though I find BayesIt much more competent (and accurate), 
I don't seem to understand much of its advanced features.

        I've read from many that the new Advanced.ini file contains comments from the 
developer explaining the various options, but mine (v5.5) does not.  The only "help" 
available from the BayesIt site is outdated and refers to an updated version in the 
RitLabs page, but its in Russian, which I cannot read.  So, with the help of the fish 
(the Babelfish, that is), an online Russian-English dictionary, and a bit of deductive 
reasoning, I was able to translate it as best as I could.  It helped me a bit, so I 
thought it might help others too.

        Still, some explanations are a bit too technical, and they could use some 
finessing, so if anybody can help further, I (and others, I'm sure) will appreciate it 
inmensely.  Technical or not, its still more understandable to us non-russian speaking 
people.

;working thread priority (2)
;    Determines the priority of the base retraining process.
;    Retraining is carried out by the filter in the background mode
;    and it is usually imperceptible to the user. By default, the
;    value of this parameter (2) corresponds to the system parameter
;    "THREAD_PRIORITY_LOWEST".

;onexit thread priority (3)
;    If, during the retraining process, the user clicked on the exit
;    button in The Bat!, the retraining process will acquire the
;    indicated priority. Usually, it is higher than normal. This is
;    necessary so that the filter notifies the current retraining
;    operation as soon as possible when it is safe to interrupt the
;    process without risk of losing important data. By default, the
;    value of this parameter (3) corresponds to the system parameter
;    "TRHEAD_PRIORITY_NORMAL".

;export selective download (1)
;    When defined, the filter will export the collection of trigger
;    lines for the selective download filter. If the parameter is set
;    to "1", then the filter will create the file selective.txt in
;    the working folder, which will contain the constantly updated
;    list of regular expressions encountered in the headers of
;    spam-messages. If this parameter is set to "0", then no lists of
;    lines will be exported.

;selective download spam threshold (10)
;    Determines with what frequency any one token must appear in the
;    headers of spam messages in order for it to be included in the
;    file selective.txt (see the previous parameter). It is
;    recommended that this number is computed so that the size of the
;    file selective.txt would not exceed 40Kb-50Kb. With larger lists
;    of trigger lines, The Bat! becomes unstable. Words are selected
;    into the file selective.txt based on the following criteria: the
;    word must exist in the headers of the message and must never be
;    encountered in the headers of non-spam messages, and has been
;    encountered n number of times in the headers of spam messages;
;    where n corresponds to the discussed parameter.

;simple digits spam marks (1)
;    Allows html-comments in the messages of the form <!--2345-->
;    (i.e. consisting of some numbers) to be treated as special
;    generalized technical tokens. Since such headers are encountered
;    in essence in spam messages, this special token can
;    substantially help during the analysis of some messages.

;no spaces spam marks (1)
;    Is analogous to the previous parameter; however, it treats as
;    special tokens not only numerical comments, but any comment
;    which does not contain whitespace.

;limit size to hash (19)
;    Allows you to assign a maximum length to the words which will be
;    stored in the base unchanged. If any word exceeds the assigned
;    length (for example, a pgp-signature), then it will be
;    automatically encoded into a hash and saved in the base in its
;    original form.

;limit size to hash header (96)
;    Assigns a similar length to the tokens from the headers of
;    messages. This parameter is greater than for tokens from the
;    body in order to improve the possibilities of the export of
;    trigger lines for the "selective download" filter.

;temporary dictionary ("c:\\.temp")
;    Assigns the directory where the filter will store its temporary
;    working files in such a case when it is not possible to find a
;    system temporary folder intended for this purpose. This
;    directory will only be used in such critical case, but it is not
;    used for the normal work of the filter.

;use expiration (0)
;age to expirate (100)
;    These are not used in the current version of the filter;
;    however, they must be present in the file.

;learn from zero (1)
;    Defines whether the filter will regard all letters in any way,
;    not marked specifically as not-spam, so that it can be trained
;    "from zero", i.e. generally without the analysis base.
;    Otherwise, to train for the first time, at least one message
;    must be marked as spam and not-spam.

;max size of log file (131072)
;    Defines the maximum size that the filter's log file can grow.
;    When this size is exeeded, the file will be automatically
;    renamed into a similar filename prefixed with the "~" (tilde)
;    symbol, and writing will proceed in a new file. Please note that
;    the size of the log file CAN exceed the given number, albeit, as
;    a rule, temporarily (the file size is checked after writing the
;    last record into the file).

;recalculating strategy (3)
;    Determines the quantity of messages that can accumulate before
;    an automatic retraining of the filter's base is triggered. In an
;    ideal scenario, training would occur after each message is
;    processed; however, with sufficiently large bases this leads to
;    the unjustified expenditure of the computer's resources.
;    Moreover, with a large base, frequent retraining is futile. This
;    parameter allows you to set the behavior of the filter in these
;    situations. Values are interpreted as follows: if this number is
;    more than or equal to 1, then it is rounded off to the nearest
;    whole number, and is used as the absolute quantity of messages
;    which must be received/marked for training to start. The default
;    value is 3 - i.e. messages are trained in bundles of at least 3
;    messages. It is also possible to assign values from 0 to 1 - for
;    example, 0.001. Such values allow you to assign a threshold
;    quantity of messages for retraining, depending on the current
;    size of the base. The Formula is simple: (quantity of messages
;    in the spam base + quantity of messages in the not-spam base) *
;    recalculating stragety. In other words, if we have 20,000
;    not-spam messages and 10,000 spam messages, then the result is
;    (20000 + 10000) * 0.001 = 30 messages. This makes it possible to
;    reduce the rate of retraining even more.

;regarding threshold (1.5)
;    Determines by how many times the not-spam part of the base "is
;    heavier" than the spam part. This artificial asymmetry of the
;    base makes it possible to practically completely avoid false
;    positives when messages are being marked as "spam". When this
;    value is set to 1.0, the bases will be completely symmetrical;
;    however, this is not recommended in view of possible false
;    positives. It is best to leave this parameter as is (1.5) or
;    even to increase it to 2.0. Please note: if this parameter is
;    equal to zero, then ALL entering messages will be regarded as
;    spam, since the not-spam part of the base, being multiplied by
;    zero, will result in zero. If you notice that the filter is
;    marking all your incoming messages as spam, then this parameter
;    is the first thing you should check, if the base is relatively
;    balanced.

;use autotrain (1)
;    Defines whether the filter will automatically mark incoming
;    messages as spam/not-spam on the basis of their analysis. By
;    default, the filter is trained automatically, you should only
;    correct its errors, by manually marking erroneous messages as
;    spam/not-spam. However, there can be situations when it may be
;    necessary to refuse automatic training. For example, if you
;    decide to install an additional antispam-plugin, BayesIt will
;    not be able to be properly trainined independently, since in
;    this case the general probability of a message will influence
;    not only its "decision", but also the probability of the second
;    plugin. In such cases, this parameter should be set to "0" and
;    training of filter should be done only manually (this means that
;    you will have to mark ALL incoming messages as spam/not-spam
;    accordingly).

;use degeneration (1)
;    Determines a working policy for the filter in the event that a
;    token was not found in the base when analysing the message. If
;    degeneration is switched off, the filter will take probability
;    according to the default (0.4). If it is defined, then the
;    filter will also verify against the base all possible variations
;    of a tested token (such as a tracing capital letter, checking on
;    the presence in body/headers of messages, checking a token with
;    several exclamation marks added to it, etc.) and it will select
;    the probability of that variation which gives the result that is
;    farthest from neutral. By default, degeneration is set; however,
;    its use forces in some situations to search for the best variant
;    in approximately 8 times more tokens than there are actually
;    contained in the message. Therefore in slow machines, turning
;    this parametor off can increase performance.

;number or exclamations (5)
;    Defines a maximum quantity of exclamation marks (!), which will
;    be consecutively added to a token while generating its variants
;    during degeneration. If degeneration is switched off, then this
;    parameter is ignored.

        dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4


________________________________________________
Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html

Re: BayesIt filtering: Spam error high

Reply via email to