At 11:59 AM 1/30/2004, PieterB wrote:
Shouldn't a message that is identified as spam by the bayesian
filter of spamassassin (BAYES_90 or BAYES_99 in my case) never be
used as a message that is learned as ham?  (I would expect it
not to be used for learning because it wouldn't improve the
bayesfilter, and training it as ham makes the bayesian filter
perform worse in future). Am I missing something?

You're missing quite a bit about how bayes works on a fundamental level... you really DO want to train spam that already hits BAYES_99.

You need to remember that bayes doesn't learn "a message". It breaks it up into little pieces and learns those.

Training spam that already matches BAYES_99 is a perfectly reasonable and in fact GOOD thing to do, and can improve the filter. Just because the overall probability is high, doesn't mean there's nothing left to learn. There's likely to still be a few tokens that were never learned before. Those tokens could be key in identifying future spam.

This is particularly true because spam mutates over time. As little nuances are introduced, it's important to train them so that the scores stay high as the spam continues to mutate.

The only thing that's bad is allowing bayes to self-feedback. ie: using bayes_99 as a reason to autolearn is a _bad_ thing. If you do that, one mistake in your bayes DB will self-amplify.

The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
Spamassassin-talk mailing list

Reply via email to