On Apr 13, 2006, at 9:56 AM, mouss wrote:
I am also seing many legit mail trigering some SA rules (*_exess,
no_real_name, x_library, ...). when I see this, I check the rule, and
if I can't find a justification, I disable it.
I wouldn't do that.
Just because legitimate mail triggers some rule doesn't mean that the
rule is flawed. Using your example, triggering "no_real_name" does not
mean that the message is spam, it means that the message has _some_
similarity to at least some spam messages (the higher the score, the
stronger the similarity). And, that's absolutely true: statistically,
when looking at the corpus which was used to create the rules database,
a higher percentage of "no_real_name" messages were spam.
Now, if legit messages were not just triggering those rules, but also
triggering enough rules to be flagged as spam ... then I would lower
the value of those rules, but not disable those rules. But I would
only do that if I could see that there was a large percentage of
should-be-ham messages being flagged as spam by that rule AND that rule
wasn't being useful in flagging spam messages. The reason is: if the
message is being flagged, but it shouldn't have been, then perhaps my
"corpus" of messages differs significantly enough from the SA internal
corpus that my score values need to be different. But that doesn't
mean that the rules are so disjoint from tracking spam that they should
be entirely disabled. They just don't have the same weighting that my
corpus needs.
If, instead, most messages passing through my mail servers, that
triggered that rule, really did seem to be spam, then I wouldn't alter
the score at all. I would just pass the should-have-been-ham message
into my bayesian learner and hope that a low bayes score for messages
like that would offset the rules had flagged it as spam.