Re: The trouble with Bayes

Jim Maul 6 May 2005 14:16:49 -0000

Paul Boven wrote:

Hi Jim,
Jim Maul wrote:
Paul Boven wrote:
Bayes is a very powerfull system, especially for recognising site-specific ham. But at this moment, apx. 30% of the spam that slips trough my filter has 'autolearn=ham' set. And another 60% of the spam slipping trough has a negative Bayes score to help them along. For the moment, I've disabled the autolearning in my Bayes system.
If your system is autolearning 30% of the spam as ham it is seriously screwed up.
No, fortunately that's not the case. Of all the spam that slips trough (which is still just below 1%), about a third doesn't only manage to slip trough, but even to get autolearned the wrong way.


Ok so not seriously screwed up, only mildly screwed up ;)

It only autolearns when its pretty damn sure of its classification of the message in question. A bad bayes database will only continue to get worse if left alone. The trick is starting out good with the learning and its cake from there. On some systems its even less of an issue. I've maybe manually sa-learn'ed 20-30 messages ever in a little over a year using SA. Everything else has been autolearned. Its rare that i see bayes scores other than _00 and _99. I'd say my bayes db is pretty damn accurate at this point, and its done most of it on its own. Now keep in mind that i've altered the scores of some rules (bayes mostly) and i've also adjusted the autolearn thresholds for my system. I've upped the spam and lowered the ham numbers so nothing will be autolearned unless SA is REALLY sure it knows what its doing. I'd tend to think its easier to tweak the system a bit than to change the way bayes/autolearning works..but hey, thats just me.
Thanks for your response. What tresholds have you set for autolearning, and how exactly do you do your retraining? How many users does your SpamAsassin setup have?


bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0

Note that the -0.1 for the ham threshold will cause almost no messages to be autolearned unless you are running a lot of negative scoring rules. I have some, but not a lot.

How exactly do i do the retraining? I have my mail account set to leave the messages on the server. Whenever i have a message which needs training (i havent had a missed spam in months so the only things that need training are spam that was tagged but not autolearned because my threshold is set to 10 and my tagging happens at 5) i just ssh into the server and run sa-learn on the message in my mailbox.

We have about 100 users and about 2k messages/day

Over here, the auto-learning treshholds are still at their default values (though I've disabled auto-learning for now), re-training is done by sending the offending message back to the filter in a Message/RFC822 attachement and there are about 90 users using the system. My Bayes database is in fairly good shape, but some kinds of spam have managed to get themselves a negative score.

I realize that my setup is smaller than most so its easier for me to keep an eye on the system to see what is autolearned. Autolearning errors need to be corrected immediately or things start to snowball in a bad way.

So basically, i think the best practice (atleast for my situation) is to leave autolearning on, but adjust the thresholds so things dont get learned in either direction unless absolutely sure (really low/really high scores). Then, everything that gets an autolearn=no can be manually trained in the correct direction. As always, bayes can work wonders, but it needs a little hand holding..dont give up on it just yet ;)

-Jim

Re: The trouble with Bayes

Reply via email to