RE: possibly a dumb comment, apologies if I'm being a n00b

Hackworth, Keith A 10 Mar 2004 18:12:09 -0000

I've worked on one and it's working great for me.  The only problem is
that I had to modify the EvalTests.pm file that came in SA, so I'll have
to add it again when I upgrade.  The rule only applies to the subject.
I haven't officially tested it, but it's running in my production
environment due to its success.  I'd appreciate any comments on this,
besides me not officially testing it first 8+), and would REALLY
appreciate it if someone could test it.


This rule is very hard to explain, so please bear with me...  once you
get it working (and understand it), it works REALLY well!

The update to EvalTests.pm allows me to catch all [EMAIL PROTECTED] l|k3 th1s 
in the
subject.  I would run it on the body, but I'm afraid it'll eat up too
much system resources.  It knows some easily translated characters that
are often used to "hide" words (!->I, 3->e, l->I, |->I or l, @->a, etc.)

My change requires a control file (a list of words that are commonly
altered like "[EMAIL PROTECTED]@x") that includes 2 columns, <tab> delimited.  
It
includes the usual "spammy" stuff.  There's one caveat to the control
file that's important to remember when adding new words: when the word
includes an I or an L, it has to have a | (pipe) in its place in the 1st
column, so the work "like" would be "||ke" in my file.  This is because
many times l (lower L) will be replaced for I and | can be used for
either I or L - I like to catch these too.

The control file:
Examples:
v|agra
free    free
l|m|ted
d|p|oma
ema||   email
d|et|ng dieting
debt    debt

  The first column is required - it has what the word will look like
after translated (L & I -> | (pipe)).  The 2nd column is an optional
field that tells the system that if it really did look like this (case
insensitive), then let it through (so d3bt won't go, but debt does).  If
the column doesn't exist, it doesn't matter how it was spelled, it's
getting blocked (like [EMAIL PROTECTED]@).  I've attached my "badwords" file 
(put it
in /etc/mail/spamassassin and remove the .txt extension from the name).

First, I had to make a change to my local.cf file:
header EASY_TRANS       eval:check_for_easy_trans()
describe EASY_TRANS     Character translations made a known bad word
score EASY_TRANS        20.0


Here's the code change:

Find the EvalTests.pm file in the perl libraries and make the following
changes:  (my file is in
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/EvalTests.pm
 
use vars qw{  # find this line (around line 22)
  $IP_ADDRESS $IPV4_ADDRESS
  $CCTLDS_WITH_LOTS_OF_OPEN_RELAYS
  $ROUND_THE_WORLD_RELAYERS
  $WORD_OBFUSCATION_CHARS 
  $CHARSETS_LIKELY_TO_FP_AS_CAPS
  %BADS                 # add this line!!!!
};

# add from here down...

open(BADWORDS,"< /etc/mail/spamassassin/badwords");
while(<BADWORDS>) {
      chomp(my $wordline=lc($_));
      (my $word,my $proper)=split(/\t/,$wordline);
        $proper=1 if (!defined $proper);
      $BADS{$word}=$proper;
}
 
sub check_for_easy_trans {
      my($self)[EMAIL PROTECTED];
      my $subject = lc $self->get ('Subject');
      chomp($subject);
      my $word;
      my $origword;
      foreach $word (split(/\s{1,}/,$subject)) {
            $origword=$word;
            $word=~s/5/s/g;
            $word=~s/3/e/g;
            $word=~s/0/o/g;
            $word=~s/9/g/g;
            $word=~s/\@/a/g;
            $word=~s/\(\)/o/g;
            $word=~s/\+/t/g;
            $word=~s/\$/s/g;
            $word=~s/6/g/g;
            $word=~s/[il\!1]/\|/g;
            my $ok_spelling=$BADS{$word};
            if ($origword ne $ok_spelling) {
                  return 1 if ($BADS{$word});
            }
      }
}

 
Thanks,
Keith Hackworth




-----Original Message-----
From: R Michael Harman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 10, 2004 12:29 PM
To: [EMAIL PROTECTED]
Subject: possibly a dumb comment, apologies if I'm being a n00b

So, I've been looking over the types of messages that SA is missing, on
my
mail stream...  It seems like most of them still contain trigger words
that would cause high scores, they're just slightly masked.  Has anyone
talked about applying some kind of fuzzy-matching techniques?  Taking
the
trigger words, and generating a whole large set of patterns that match,
based on rules such as:

'a' => /a|A|@)/
'x' => /x|X|></

You might even be able to use a large corpus of spam to automatically
derive these rules.  (A corpus of parsed-out and "translated" tokens
would
work better, obviously.)

You could also introduce some Hamming Distance effects to the match, so
that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and
"hillo".  And then there's the possibility of doing phonetic matching,
like many spellcheckers.

Using any/all of this stuff would be pretty processor intensive --
probably much more practical for ISPs than for users -- but it seems
like
it'd kill off almost all of the new crop of SA-evading spam.  Maybe
somebody could lure Larry Wall into building this kind of fuzzy-match
technology directly into the next major version of Perl? *g*

Just thought I'd throw that out there.  Aside from that, I'll probably
lurk for a while; if I end up feeling out of my depth (which is possible
-- my actual day job is as a linguist, and most of my coding skills,
such
as they are, are aimed at that) I'll unsub.

Thanks,
Auros

------------------------------------------------------------------------
R Michael Harman / Auros Symtheos
[EMAIL PROTECTED] ............ http://www.auros.org/

Linguist and Eclectic Engineer, Lexicus, Motorola
[EMAIL PROTECTED] ......... http://www.lexicus.mot.com/

Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly
[EMAIL PROTECTED] ... http://www.strangehorizons.com/

*****
"The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential, proprietary, and/or privileged 
material.  Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited.  If you received this in error, 
please contact the sender and delete the material from all computers."  113

v|agra
free    free
offer
||m|ted
d|p|oma
c|asses
xanax
pound   pound
ty|eno|
ty|eno|e
youth   youth
sex
amb|en
ema||   email
younger
fee|    feel
d|et    diet
d|et|ng dieting
orgasm
banned
sexua|
c||max
debt    debt
nude
naked
va||um
|ook    look
spr|ng  spring
age|ng
cheat   cheat
we|ght  weight
now     now
|oose   loose
w|||    will
soon    soon
|mproved        improved
porn

RE: possibly a dumb comment, apologies if I'm being a n00b

Reply via email to