Taking the trigger words, and generating a whole large set of patterns that match,
based on rules such as:
'a' => /a|A|@)/ 'x' => /x|X|></
Well, there's no reason to include a and A, since it's more efficient to declare the rule case-insenstive instead, but that aside, these kinds of obfuscation matches work well. I've been working with this stuff quite a bit in antidrug.cf.
There's also a wide variety of "gapping" techniques used by spammers.
If for example, you look at my antidrug ruleset, you'll see my current regex for the v-word is:
body __DRUGS_MALEDYSFUNCTION1 /(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[i1|l][_\W]{0,[EMAIL PROTECTED],3}g[_\W]{0,3}r[_\W]{0,[EMAIL PROTECTED],3}x?[_\W]{0,3}(?:\b|\s)/i
This catches most of the common substitutions for A,O, V, and I. Allows for a number of gapping characters consisting of non-word and _ characters between letters (as few as 0 and as many as 3). It also allows an optional x that some spammers tack onto the end of the word.
However, there's a little bit of caution needed before comprehensively applying this technique everywhere. The large number of match combinations can lend itself to FPs. I've been slowly propagating these features throughout the antidrug ruleset.
You could also introduce some Hamming Distance effects to the match, so that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and "hillo". And then there's the possibility of doing phonetic matching, like many spellcheckers.
There's been a fair amount of talk about doing things like this in the past. I've never seen anyone propose hamming distances or phonetic methods specifically, but many similar ideas have come by.
One thing that needs to be kept in mind is that SA is already quite effective against mis-spelling tactics... mis-spellings are VERY easy targets for any bayes system, you just have to be up-to-speed on your training.
Some similar proposals from the past:
Pure Spell checking has been shot-down after testing (not the same as what you proposed, but it is related)
http://bugzilla.spamassassin.org/show_bug.cgi?id=2868
Soundex matching has been proposed, but is too unreliable, too slow, and offering little advantage that bayes doesn't provide
http://bugzilla.spamassassin.org/show_bug.cgi?id=1407
