Matt Kettler wrote: > Bowie Bailey wrote: > > > > The Bayes rules are not individual unrelated rules. Bayes is a > > series of rules indicating a range of probability that a message is > > spam or ham. You can argue over the exact scoring, but I can't see > > any reason to score BAYES_99 lower than BAYES_95. Since a BAYES_99 > > message is even more likely to be spam than a BAYES_95 message, it > > should have at least a slightly higher score. > > No, it should not. I've given a conclusive reason why it may not > always be higher. My reason has a solid statistical reason behind it. > This reasoning is supported by real-world testing and real-world data. > > You've given your opinion to the contrary, but no facts to support it > other than declaring the rules to be related, and therefore the > score should correlate with the bayes-calculated probability of spam. > > While I don't disagree with you that BAYES_99 scoring lower than > BAYES_95 is counter-intuitive. I do not believe intuition alone is a > reason to defy reality. > > If there are other rules with better performance (ie: fewer FPs) that > consistently coincide with the hits of BAYES_99, those rules should > soak up the lions share of the score. However, if there are a lot of > spam messages with no other rules hit, BAYES_99 should get a strong > boost from those. > > The perceptron results show that the former is largely true. BAYES_99 > is mostly redundant. To back it up, I'm going to verify it with my > own maillog data. > > Looking at my own current real-world maillogs, BAYES_99 matched 6,643 > messages last week. Of those, only 24 had total scores under 9.0. > (with BAYES_99 scoring > 3.5, it would take a message with a total score of less than 8.5 to > drop below the threshold of 5.0 if BAYES_99 were omitted entirely). > > So less than 0.37% of BAYES_99's hits actually mattered on my system > last week. > > BAYES_95 on the other hand hit 468 messages, 20 of which scored less > than 9.0. That's 4.2% of messages with BAYES_95 hits. A considerably > larger percentage. Bringing it down to 8.0 to compensate for the > score difference and you still get 17 messages, which is still a much > larger 3.6% of it's hits. > > On my system, BAYES_95 is significant in pushing mail over the spam > threshold 10 times more often than BAYES_99 is. > > What are your results? > > These are the greps I used, based on MailScanner log formats. Should > work for spamd users, perhaps with slight modifications. > > zgrep BAYES_99 maillog.1.gz |wc -l > zgrep BAYES_99 maillog.1.gz |grep -v "score=[1-9][0-9]\." | grep -v > "score=9\." | wc -l
I think we are arguing from slightly different viewpoints. You are saying that higher scores are not needed since the lower score is made up for by other rules. I have 13,935 hits for BAYES_99. 412 of them are lower than 9.0. This seems to be caused by either AWL hits lowering the score or very few other rules hitting. BAYES_95 hit 469 times with 18 hits lower than 9.0. This means that, for me, BAYES_95 is significant slightly more often, percentage-wise, than BAYES_99. But considering volume, I would say that BAYES_99 is the more useful rule. However, that's not what I was arguing about to begin with. Because of the way the Bayes algorhytm works, I should be able to have more confidence in a BAYES_99 hit than a BAYES_95 hit. Therefore, it should have a higher score. Otherwise, you get the very strange occurance that if you train Bayes too well and the spams go from BAYES_95 to BAYES_99, the SA score actually goes down. The better you train your Bayes database, the more confidence it should have in picking out the spams. As the scoring moves from BAYES_50 up to BAYES_99, the SA score should increase to reflect the higher confidence level of the Bayes engine. -- Bowie