mouss wrote:

...

The approach is flawed. a single word shouldn't be enough to tag mail as spam.

Furthermore, even checking for word boundaries may not help a lot on the OEM spammers. Several of them do quite a bit of obfuscation work to try to bypass simple filtering that the OP is asking for. One of the ones that I'm seeing right now is "Office2OO8" (no boundaries), and obfuscation by replacing numeric zero with alpha "o".

Remember that the approach of SpamAssassin is not to do single rules that will force a big-bang reject, if a particular rule is hit (although I have a few of those, because my rules are well-crafted, and I know my servers' traffic flows).

The general approach is with cumulative scores -- the higher the score, the more likely the message is to be spam. However, a relatively high score doesn't necessarily mean that the message is spam, and a low score doesn't mean that the message isn't spam. Spam-fighting at this level is as much art as science, and the spammers are a moving target that go to great lengths to make their stuff indistinguishable from legitimate mail.

Thus, in evaluating how aggressive you are in fighting spam, a lot comes down to your (and your users') tolerance for problems with accidental rejection of something that somebody wants. This is part of the approach of SpamAssassin, in that the score returned is a best-guess opinion, based on what rules were hit, and the cumulative score.

Depending on how you have SA implemented, it's good to have a "middle range", where messages that are likely to be spam, and delivered to the user (with the opinion of the probability of spam, reported in the SA score), but where handling/disposal is left to user-level decisions (either manually or by client-level filtering).

There is, of course a point where something is so likely to be spam (e.g., a SA score above a certain threshold), that it is worth rejecting.

Thus, that isn't to say that you can't use custom-built SA rules to force rejection, but in so doing, you have to do your rules carefully, and know your traffic flows -- both the spam and the ham. Plus, test, test, test.

When I'm evaluating the possibility of a new rule, one of the things I typically do is implement a rule, and then assign a token score (say, .1 points), and then watch mail flows for a while to see how the rule is behaving. Only after I'm confident that rule is hitting what I want (and nothing else) do I increase the score.

The other tool that helps a lot when you want to be aggressive about certain kinds of spam is making use of meta rules. Use two or three (ore even more rules, especially with boolean logic) that score lowly, even with token scores, as noted above, and use big scores only in the meta rule, which generates hits only when several other rules are hit.

Thus, for the OP, there's nothing necessarily wrong with looking for the word "office" in a subject line, but I wouldn't score anything more than 0.1 points. As others in this thread have noted, it's a common word that's likely to show up frequently in non-spam. Thus, you need additional rules that you can use with that one in a meta rule, which essentially says, "If 'office' in the subject line AND <rule x> AND <rule y> then assign big score". However, if 'office' is there, but <rule x> doesn't get a hit, then you don't have enough confidence that the message is probably spam, and you don't assign the big score. In that case, if it turns out that the message really is spam, you have to go back to pattern analysis to find another pattern that matches.

Smith

Reply via email to