Yassen Damyanov wrote: > On Friday 10 December 2004 18:21, Kris Deugau wrote: > > I've > > had trouble in the past with Bayes learning very low-scoring spam > > as ham - so I lowered the autolearn-as-ham threshold to -0.1.
> I came to a conclusion that some real spams got scored very low and > poisoned the Bayes db, so it started to make mistakes and thus > poisoned itself even more, then made worse mistakes, ... etc. Yep. For a while about a year ago I was seeing spams coming in that were only hitting a rule called HTML_ONLY (IIRC). This rule scored 0.1. My autolearn threshold at the time was the default 0.2, and so these spams were getting learned as ham. I dropped the autolearn threshold, and I've been fortunate to have ~2% of my users regularly forward FNs to me (properly, no less!) for manual learning so I've been able to feed quite a few such messages back in as spam. (Learning a message as spam when it's previously been learned as ham will effectively result in the Bayes db appearing as if that message had never been learned as ham - in theory, and so far it's worked for me.) > I switched off the auto_learning (not needed IMHO when we have > regular manual learning sessions) and then deleted the old database > and rerun the manual learning script. Mmmh. I've left autolearn enabled in order to leave me with less work. You *do* have to make sure you get feedback, but if you have a regular process in place for learning a block of legit mail and a block of spam (confirmed by hand-sorting) then it may not be quite as important. I think the biggest problem many people have with Bayes is the early maintenance - if you make a few tweaks (like lowering the autolearn-as-ham threshold) and watch it closely for two to three weeks, you should end up with a pretty good start. I've *never* had to wipe the global Bayes files on the two systems I've been running SA on at work; they've been running Bayes since shortly after 2.54 was released. (I waited a little to upgrade from 2.44 on these machines, 2.50 thru 2.53 ended up being a little unstable and flaky IIRC.) I've never had to wipe the Bayes files on my personal email on my own server. *BUT*.... In all three cases I've watched my email (or customer feedback regarding their email) and I've looked ahead a little to try to forsee problems and fix them before they become BIG problems. > BTW, how to interpret things like "tests=BAYES_56" or > "tests=BAYES_00" in the X-Spam-Status header? That's just the list of tests that triggered or matched, out of the complete list of tests SA runs. Depending on your setup, the appropriate score for that test is added to the total. The BAYES_* tests are a probability check - based on the tokens in the message compared with the tokens in the DB, the message is considered to have a probability n% of being spam. There are sets of Bayes scores for different probability ranges; see the SA docs and default rule files to see exactly where the boundaries are. -kgd -- Get your mouse off of there! You don't know where that email has been!