On Apr 13, 2006, at 9:56 AM, mouss wrote:


I am also seing many legit mail trigering some SA rules (*_exess, no_real_name, x_library, ...). when I see this, I check the rule, and if I can't find a justification, I disable it.


I wouldn't do that.

Just because legitimate mail triggers some rule doesn't mean that the rule is flawed. Using your example, triggering "no_real_name" does not mean that the message is spam, it means that the message has _some_ similarity to at least some spam messages (the higher the score, the stronger the similarity). And, that's absolutely true: statistically, when looking at the corpus which was used to create the rules database, a higher percentage of "no_real_name" messages were spam.

Now, if legit messages were not just triggering those rules, but also triggering enough rules to be flagged as spam ... then I would lower the value of those rules, but not disable those rules. But I would only do that if I could see that there was a large percentage of should-be-ham messages being flagged as spam by that rule AND that rule wasn't being useful in flagging spam messages. The reason is: if the message is being flagged, but it shouldn't have been, then perhaps my "corpus" of messages differs significantly enough from the SA internal corpus that my score values need to be different. But that doesn't mean that the rules are so disjoint from tracking spam that they should be entirely disabled. They just don't have the same weighting that my corpus needs.

If, instead, most messages passing through my mail servers, that triggered that rule, really did seem to be spam, then I wouldn't alter the score at all. I would just pass the should-have-been-ham message into my bayesian learner and hope that a low bayes score for messages like that would offset the rules had flagged it as spam.

Reply via email to