Bowie Bailey wrote:
> Matt Kettler wrote:
>> It is perfectly reasonable to assume that most of the mail matching
>> BAYES_99 also matches a large number of the stock spam rules that SA
>> comes with. These highly-obvious mails are the model after which
>> most SA rules are made in the first place. Thus, these mails need
>> less score boost, as they already have a lot of score from other
>> rules in the ruleset. 
>>
>> However, mails matching BAYES_95 are more likely to be "trickier",
>> and are likely to match fewer other rules. These messages are more
>> likely to require an extra boost from BAYES_95's score than those
>> which match BAYES_99.
> 
> I can't argue with this description, but I don't agree with the
> conclusion on the scores.
> 
> The Bayes rules are not individual unrelated rules.  Bayes is a series
> of rules indicating a range of probability that a message is spam or
> ham.  You can argue over the exact scoring, but I can't see any reason
> to score BAYES_99 lower than BAYES_95.  Since a BAYES_99 message is
> even more likely to be spam than a BAYES_95 message, it should have at
> least a slightly higher score. 

No, it should not. I've given a conclusive reason why it may not always be
higher. My reason has a solid statistical reason behind it. This reasoning is
supported by real-world testing and real-world data.

You've given your opinion to the contrary, but no facts to support it other than
 declaring the rules to be related, and therefore the score should correlate
with  the bayes-calculated probability of spam.

While I don't disagree with you that BAYES_99 scoring lower than BAYES_95 is
counter-intuitive. I do not believe intuition alone is a reason to defy reality.

If there are other rules with better performance (ie: fewer FPs) that
consistently coincide with the hits of BAYES_99, those rules should soak up the
lions share of the score. However, if there are a lot of spam messages with no
other rules hit, BAYES_99 should get a strong boost from those.

The perceptron results show that the former is largely true. BAYES_99 is mostly
redundant. To back it up, I'm going to verify it with my own maillog data.

Looking at my own current real-world maillogs, BAYES_99 matched 6,643 messages
last week. Of those, only 24 had total scores under 9.0. (with BAYES_99 scoring
3.5, it would take a message with a total score of less than 8.5 to drop below
the threshold of 5.0 if BAYES_99 were omitted entirely).

So less than 0.37% of BAYES_99's hits actually mattered on my system last week.

BAYES_95 on the other hand hit 468 messages, 20 of which scored less than 9.0.
That's 4.2% of messages with BAYES_95 hits. A considerably larger percentage.
Bringing it down to 8.0 to compensate for the score difference and you still get
17 messages, which is still a much larger 3.6% of it's hits.

On my system, BAYES_95 is significant in pushing mail over the spam threshold 10
times more often than BAYES_99 is.

What are your results?

These are the greps I used, based on MailScanner log formats. Should work for
spamd users, perhaps with slight modifications.

zgrep BAYES_99 maillog.1.gz |wc -l
zgrep BAYES_99 maillog.1.gz |grep -v "score=[1-9][0-9]\." | grep -v "score=9\."
|wc -l


Reply via email to