Re: SpamAssassin tested by lwn.net

Justin Mason Thu, 02 Mar 2006 07:23:32 -0800

Matt Kettler writes:
> Michael Monnerie wrote:
> > http://lwn.net/SubscriberLink/173910/e7bf95a7cb044637/
> >
> > They are wondering why bayes_99 is not given 5 points by default, as it 
> > seems to have no FP.
> 
> Statisticaly speaking, 1% of BAYES_99 hits should be nonspam.In reality,
> it does a lot better than that.
> 
> However, in the SA 3.1.0 set3 mass checks it still managed to match
> about 21 messages in the nonspam test set:
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>  176869   123778    53091    0.700   0.00    0.00  (all messages)
> 60.712  86.7351   0.0396    1.000   0.90    3.50  BAYES_99
> 
> SA's scores aren't based on human assumptions about how the rules
> behave. They are based on real-world testing and a perceptron
> score-fitting system that accounts not only for the hit-rate of the
> rule, but also for the combinations of rules that it tends to match
> with. Often the reality is a lot more complex than you think.


It's important to note that, without good training, BAYES_99 may indeed
fire regularly on nonspam mail -- that's the danger with user-trained
rules.  In the *default* scenario, therefore, a score of 3.5 is reasonably
optimal.   However, if good training is supplied, it's a good plan to
increase the BAYES_99 score to 5.0, or even more.  (I think we might
mention that somewhere in the documentation -- I hope. ;)

Also, it's worth noting that "BAYES_99" doesn't really refer to a 1%
probability.   SpamAssassin uses the Fisher Inverse Chi-Square Procedure
described at http://garyrob.blogs.com/whychi90.pdf , and as a result these
are no longer true probability values -- so don't expect to see
probabilistic distributions.

Great articles btw.  The grumpy editor has outdone himself ;)
[I've posted this as a comment on the story already btw.]

--j.

Re: SpamAssassin tested by lwn.net

Reply via email to