mouss wrote:
...
The approach is flawed. a single word shouldn't be enough to tag mail as
spam.
Furthermore, even checking for word boundaries may not help a lot on the
OEM spammers. Several of them do quite a bit of obfuscation work to
try to bypass simple filtering that the OP is asking for. One of the
ones that I'm seeing right now is "Office2OO8" (no boundaries), and
obfuscation by replacing numeric zero with alpha "o".
Remember that the approach of SpamAssassin is not to do single rules
that will force a big-bang reject, if a particular rule is hit (although
I have a few of those, because my rules are well-crafted, and I know
my servers' traffic flows).
The general approach is with cumulative scores -- the higher the score,
the more likely the message is to be spam. However, a relatively high
score doesn't necessarily mean that the message is spam, and a low score
doesn't mean that the message isn't spam. Spam-fighting at this level
is as much art as science, and the spammers are a moving target that go
to great lengths to make their stuff indistinguishable from legitimate mail.
Thus, in evaluating how aggressive you are in fighting spam, a lot comes
down to your (and your users') tolerance for problems with accidental
rejection of something that somebody wants. This is part of the
approach of SpamAssassin, in that the score returned is a best-guess
opinion, based on what rules were hit, and the cumulative score.
Depending on how you have SA implemented, it's good to have a "middle
range", where messages that are likely to be spam, and delivered to the
user (with the opinion of the probability of spam, reported in the SA
score), but where handling/disposal is left to user-level decisions
(either manually or by client-level filtering).
There is, of course a point where something is so likely to be spam
(e.g., a SA score above a certain threshold), that it is worth rejecting.
Thus, that isn't to say that you can't use custom-built SA rules to
force rejection, but in so doing, you have to do your rules carefully,
and know your traffic flows -- both the spam and the ham. Plus, test,
test, test.
When I'm evaluating the possibility of a new rule, one of the things I
typically do is implement a rule, and then assign a token score (say, .1
points), and then watch mail flows for a while to see how the rule is
behaving. Only after I'm confident that rule is hitting what I want
(and nothing else) do I increase the score.
The other tool that helps a lot when you want to be aggressive about
certain kinds of spam is making use of meta rules. Use two or three
(ore even more rules, especially with boolean logic) that score lowly,
even with token scores, as noted above, and use big scores only in the
meta rule, which generates hits only when several other rules are hit.
Thus, for the OP, there's nothing necessarily wrong with looking for the
word "office" in a subject line, but I wouldn't score anything more than
0.1 points. As others in this thread have noted, it's a common word
that's likely to show up frequently in non-spam. Thus, you need
additional rules that you can use with that one in a meta rule, which
essentially says, "If 'office' in the subject line AND <rule x> AND
<rule y> then assign big score". However, if 'office' is there, but
<rule x> doesn't get a hit, then you don't have enough confidence that
the message is probably spam, and you don't assign the big score. In
that case, if it turns out that the message really is spam, you have to
go back to pattern analysis to find another pattern that matches.
Smith