Yassen Damyanov wrote:
> On Friday 10 December 2004 18:21, Kris Deugau wrote:
> > I've
> > had trouble in the past with Bayes learning very low-scoring spam
> > as ham - so I lowered the autolearn-as-ham threshold to -0.1.

> I came to a conclusion that some real spams got scored very low and
> poisoned the Bayes db, so it started to make mistakes and thus
> poisoned itself even more, then made worse mistakes, ... etc.

Yep.  For a while about a year ago I was seeing spams coming in that
were only hitting a rule called HTML_ONLY (IIRC).  This rule scored
0.1.  My autolearn threshold at the time was the default 0.2, and so
these spams were getting learned as ham.

I dropped the autolearn threshold, and I've been fortunate to have ~2%
of my users regularly forward FNs to me (properly, no less!) for manual
learning so I've been able to feed quite a few such messages back in as
spam.  (Learning a message as spam when it's previously been learned as
ham will effectively result in the Bayes db appearing as if that message
had never been learned as ham - in theory, and so far it's worked for
me.)

> I switched off the auto_learning (not needed IMHO when we have
> regular manual learning sessions) and then deleted the old database
> and rerun the manual learning script.

Mmmh.  I've left autolearn enabled in order to leave me with less work. 
You *do* have to make sure you get feedback, but if you have a regular
process in place for learning a block of legit mail and a block of spam
(confirmed by hand-sorting) then it may not be quite as important.

I think the biggest problem many people have with Bayes is the early
maintenance - if you make a few tweaks (like lowering the
autolearn-as-ham threshold) and watch it closely for two to three weeks,
you should end up with a pretty good start.

I've *never* had to wipe the global Bayes files on the two systems I've
been running SA on at work;  they've been running Bayes since shortly
after 2.54 was released.  (I waited a little to upgrade from 2.44 on
these machines, 2.50 thru 2.53 ended up being a little unstable and
flaky IIRC.)

I've never had to wipe the Bayes files on my personal email on my own
server.  *BUT*....  In all three cases I've watched my email (or
customer feedback regarding their email) and I've looked ahead a little
to try to forsee problems and fix them before they become BIG problems.

> BTW, how to interpret things like "tests=BAYES_56" or
> "tests=BAYES_00" in the X-Spam-Status header?

That's just the list of tests that triggered or matched, out of the
complete list of tests SA runs.  Depending on your setup, the
appropriate score for that test is added to the total.

The BAYES_* tests are a probability check - based on the tokens in the
message compared with the tokens in the DB, the message is considered to
have a probability n% of being spam.  There are sets of Bayes scores for
different probability ranges;  see the SA docs and default rule files to
see exactly where the boundaries are.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!

Reply via email to