jdow wrote: > And it is scored LESS than BAYES_95 by default. That's a clear signal > that the theory behind the scoring system is a little skewed and needs > some rethinking.
No.. It does not mean there's a problem with the scoring system. It means you're trying to apply a simple linear model to something which is inherently not linear, nor simple. This is a VERY common misconception. Please bear with me for a minute as I explain some things. This is more-or-less the same misconception as expecting rules with higher S/O's to always score higher than those with lower S/O's. Generally this is true, but there's more to consider that can cause the opposite to be true. The score of a rule in SA is not a function of the performance of that one rule, nor should it be. The score of a SA rule is a function of what combinations of rules it matches in conjunction with. This creates a "real world fit" of a complex set of rules against real-world behavior. This complex interaction between rules results in most of the "problems" people see. People inherently expect simple linearity. However, consider that SA scoring is a function of several hundred variable equation attempting to perform an approximation of optimal fit to a sampling of human behavior. Why, based on that, would you ever expect the score two of those hundreds of variables to be linear as a function of spam hit rate? It is perfectly reasonable to assume that most of the mail matching BAYES_99 also matches a large number of the stock spam rules that SA comes with. These highly-obvious mails are the model after which most SA rules are made in the first place. Thus, these mails need less score boost, as they already have a lot of score from other rules in the ruleset. However, mails matching BAYES_95 are more likely to be "trickier", and are likely to match fewer other rules. These messages are more likely to require an extra boost from BAYES_95's score than those which match BAYES_99.