Adam Katz wrote: > On Mon, 6 Nov 2006, John D. Hardin wrote: > >> The default scores are generated by analyzing their performance >> against hand-categorized corpa of actual emails. If a rule hits spam >> often and ham rarely, it will be given a higher score than one that >> hits spam often and ham occasionally. >> > > That sounds very Bayesian ... with Bayesian rules already doing that sort > of logic, I would hope there is more human thinking put into score > setting.
Actually, in this case, a little human thinking will mislead you. You're thinking only of a tiny view of the overall picture. Fundamentally, what rules hit and don't hit for spam are not some kind of linear equation. They're a function of human behaviors, wierd quirks of the code written into a spam generation tool by its author. All of this is very much NOT subject to any kind of simple rules like "10% is worse than 20%". When you start to realize this, you'll start to understand the scoring process.. just a little. Now consider that the rules are not scored individually. They're scored as a collective set. A single equation in hundreds of variables, all of which are simultaneously tweaked to acheive a "best fit" to real-world data. This causes the score of one rule to not just be a function of its own behavior, but also a function of other rules. A rule might perform very well, but it might also match all the same spam as another rule. If that other rule matches just slightly fewer nonspams, there's a dramatic shift in score to favor the better of the two. And thousands of combinations of smaller-scale instances of such balancing occurs in the scoring process. If you put a LOT of human thinking into it, you'll come to understand what's going on, but you really have to think about the big picture here.