Matt Kettler writes: > Michael Monnerie wrote: > > http://lwn.net/SubscriberLink/173910/e7bf95a7cb044637/ > > > > They are wondering why bayes_99 is not given 5 points by default, as it > > seems to have no FP. > > Statisticaly speaking, 1% of BAYES_99 hits should be nonspam.In reality, > it does a lot better than that. > > However, in the SA 3.1.0 set3 mass checks it still managed to match > about 21 messages in the nonspam test set: > > OVERALL% SPAM% HAM% S/O RANK SCORE NAME > 176869 123778 53091 0.700 0.00 0.00 (all messages) > 60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99 > > SA's scores aren't based on human assumptions about how the rules > behave. They are based on real-world testing and a perceptron > score-fitting system that accounts not only for the hit-rate of the > rule, but also for the combinations of rules that it tends to match > with. Often the reality is a lot more complex than you think.
It's important to note that, without good training, BAYES_99 may indeed fire regularly on nonspam mail -- that's the danger with user-trained rules. In the *default* scenario, therefore, a score of 3.5 is reasonably optimal. However, if good training is supplied, it's a good plan to increase the BAYES_99 score to 5.0, or even more. (I think we might mention that somewhere in the documentation -- I hope. ;) Also, it's worth noting that "BAYES_99" doesn't really refer to a 1% probability. SpamAssassin uses the Fisher Inverse Chi-Square Procedure described at http://garyrob.blogs.com/whychi90.pdf , and as a result these are no longer true probability values -- so don't expect to see probabilistic distributions. Great articles btw. The grumpy editor has outdone himself ;) [I've posted this as a comment on the story already btw.] --j.