On Fri, 2004-02-06 at 08:58, Loren Wilton wrote:
> Probably should also replace the obvious numeric and special characrters like 
> zer0, thr33, f|ve, $even, etc. while you are at it.
<snip>
> On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
> > 
> > I use the following (we get foreign email, but since we only understand
> > English, we expect all subject headings to be in English):
> > 
> > header    RM_sl_ForeignChar      Subject =~ /\w[����]\w/
> > ...
<snip>
> all "����" to "aeou" and *then* apply SA rules to them.

If you're interested in doing these transformations, you might want to
have a look-see at CMOScript.  I've been attacking this sort of problem
from the other side; not "translating" characters in advance, but
matching the untranslated word against a Regexp with the translations
inside it.  The idea is similar though, and I have a list of
translations you might want to use as a starting point (eg: 'b' => ['b',
'8', '\\xDF']).  The list is by no means authoritative, or complete, but
it should be a good place to start.  Grab the obfu.pl from
http://sandgnat.com/cmos/.

Also, I haven't done much to update CMOScript lately, but my plan has
been to move towards a pre-translator methodology once the SA 2.70/3.0
plugins interface is released.  Pre-transforming should help reduce
processing time (CMOScript regexps are HUGE) and should allow for more
re-use.

There are disadvantages to the pre-translate method, however.  One such
example is the character "|" which could be either an obfu "I" or an
obfu "L".  How would you choose to translate that character?  The same
goes for "*", "I", "l".  Another possible disadvantage is that it's not
as easy to translate obfu character sequences such as: "m" => "rn" or
"N" => "|\|".   I haven't yet come up with a good way to do
pre-transformation and still match these obfu types in a clean manner.

OK this was probably way off-topic and more discussion than you were
looking for.  Oh well.

-- 
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/

Reply via email to