jdow wrote:
> And it is scored LESS than BAYES_95 by default. That's a clear signal
> that the theory behind the scoring system is a little skewed and needs
> some rethinking.

No.. It does not mean there's a problem with the scoring system. It
means you're trying to apply a simple linear model to something which is
inherently not linear, nor simple.  This is a VERY common misconception. 

Please bear with me for a minute as I explain some things.

This is more-or-less the same misconception as expecting rules with
higher S/O's to always score higher than those with lower S/O's.
Generally this is true, but there's more to consider that can cause the
opposite to be true.

The score of a rule in SA is not a function of the performance of that
one rule, nor should it be. The score of a SA rule is a function of what
combinations of rules it matches in conjunction with. This creates a
"real world fit" of a complex set of rules against real-world behavior.

This complex interaction between rules results in most of the "problems"
people see. People inherently expect simple linearity. However, consider
that SA scoring is a function of  several hundred variable equation
attempting to perform an approximation of optimal  fit to a sampling of
human behavior. Why, based on that, would you ever expect the score two
of those hundreds of variables to be linear as a function of spam hit rate?

It is perfectly reasonable to assume that most of the mail matching
BAYES_99 also matches a large number of the stock spam rules that SA
comes with. These highly-obvious mails are the model after which most SA
rules are made in the first place. Thus, these mails need less score
boost, as they already have a lot of score from other rules in the ruleset.

However, mails matching BAYES_95 are more likely to be "trickier", and
are likely to match fewer other rules. These messages are more likely to
require an extra boost from BAYES_95's score than those which match
BAYES_99.








Reply via email to