On Fri, 2004-02-06 at 08:58, Loren Wilton wrote: > Probably should also replace the obvious numeric and special characrters like > zer0, thr33, f|ve, $even, etc. while you are at it. <snip> > On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote: > > > > I use the following (we get foreign email, but since we only understand > > English, we expect all subject headings to be in English): > > > > header RM_sl_ForeignChar Subject =~ /\w[����]\w/ > > ... <snip> > all "����" to "aeou" and *then* apply SA rules to them.
If you're interested in doing these transformations, you might want to have a look-see at CMOScript. I've been attacking this sort of problem from the other side; not "translating" characters in advance, but matching the untranslated word against a Regexp with the translations inside it. The idea is similar though, and I have a list of translations you might want to use as a starting point (eg: 'b' => ['b', '8', '\\xDF']). The list is by no means authoritative, or complete, but it should be a good place to start. Grab the obfu.pl from http://sandgnat.com/cmos/. Also, I haven't done much to update CMOScript lately, but my plan has been to move towards a pre-translator methodology once the SA 2.70/3.0 plugins interface is released. Pre-transforming should help reduce processing time (CMOScript regexps are HUGE) and should allow for more re-use. There are disadvantages to the pre-translate method, however. One such example is the character "|" which could be either an obfu "I" or an obfu "L". How would you choose to translate that character? The same goes for "*", "I", "l". Another possible disadvantage is that it's not as easy to translate obfu character sequences such as: "m" => "rn" or "N" => "|\|". I haven't yet come up with a good way to do pre-transformation and still match these obfu types in a clean manner. OK this was probably way off-topic and more discussion than you were looking for. Oh well. -- Chris Thielen Easily generate SpamAssassin rules to catch obfuscated spam phrases: http://www.sandgnat.com/cmos/
