On 10/20/11 8:24 PM, Adam Katz wrote:
> On 10/19/2011 04:43 AM, Mynabbler wrote:
>> You are kidding, right? 50% of this crap comes from FREEMAIL
>> addresses, and even more specific: 44% of this crap is delivered by
>> aol.com.  The aol deliveries have about 85% unique from@aol
>> addresses, so they pretty much 'own' aol.
> 
> We're writing spam filters, not idiot filters.  The fact that there is
> so much overlap is often useful, bit the overlap is not complete.  There
> is also a decent amount of overlap between the
> mostly-computer-illiterate and freemail users.  I think this drives your
> current line of thinking.
> 
> There are a lot of people that do very spammy things.  It is a testament
> to SA and other filters that such non-spam doesn't so commonly flag as spam.
> 

Sorry to come to the party late on this, was traveling a bit.

It seems to me that if you have lines like:

Subject: T R +A N/N!l :ES,  P \0 R  N
Subject: S C/H ,O 0=LG)l :R$L$S ) P -0 RN

Then the solution is to use agrep.  Make deletions of punctuation very low 
cost, as well as the usual transformations like:

0 => O
1 => l
$ => S
...

also be low-cost.  (Of course, then you end up with the possibility of clash 
between deleting $ and replacing it with 'S', but agrep is good about checking 
both)... they you just grep through a dictionary of the "usual offenders":

lesbian
cash
meds
porn
...

I'm not familiar with perl-String-Approx...  reading up on it, it uses the 
Levenshtein distances just like agrep does... so it would be ideal for doing 
approximate matches.

http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm

-Philip

Reply via email to