Adam Katz wrote:
> On Mon, 6 Nov 2006, John D. Hardin wrote:
>   
>> The default scores are generated by analyzing their performance
>> against hand-categorized corpa of actual emails. If a rule hits spam
>> often and ham rarely, it will be given a higher score than one that
>> hits spam often and ham occasionally.
>>     
>
> That sounds very Bayesian ... with Bayesian rules already doing that sort
> of logic, I would hope there is more human thinking put into score
> setting. 

Actually, in this case, a little human thinking will mislead you. You're
thinking only of a tiny view of the overall picture.

Fundamentally, what rules hit and don't hit for spam are not some kind
of linear equation. They're a function of human behaviors, wierd quirks
of the code written into a spam generation tool by its author. All of
this is very much NOT subject to any kind of simple rules like "10% is
worse than 20%".

When you start to realize this, you'll start to understand the scoring
process.. just a little.

Now consider that the rules are not scored individually. They're scored
as a collective set. A single equation in hundreds of variables, all of
which are simultaneously tweaked to acheive a "best fit" to real-world data.

This causes the score of one rule to not just be a function of its own
behavior, but also a function of other rules.

A rule might perform very well, but it might also match all the same
spam as another rule. If that other rule matches just slightly fewer
nonspams, there's a dramatic shift in score to favor the better of the two.

And thousands of combinations of smaller-scale instances of such
balancing occurs in the scoring process.

If you put a LOT of human thinking into it, you'll come to understand
what's going on, but you really have to think about the big picture here.



Reply via email to