On 10/20/11 8:24 PM, Adam Katz wrote: > On 10/19/2011 04:43 AM, Mynabbler wrote: >> You are kidding, right? 50% of this crap comes from FREEMAIL >> addresses, and even more specific: 44% of this crap is delivered by >> aol.com. The aol deliveries have about 85% unique from@aol >> addresses, so they pretty much 'own' aol. > > We're writing spam filters, not idiot filters. The fact that there is > so much overlap is often useful, bit the overlap is not complete. There > is also a decent amount of overlap between the > mostly-computer-illiterate and freemail users. I think this drives your > current line of thinking. > > There are a lot of people that do very spammy things. It is a testament > to SA and other filters that such non-spam doesn't so commonly flag as spam. >
Sorry to come to the party late on this, was traveling a bit. It seems to me that if you have lines like: Subject: T R +A N/N!l :ES, P \0 R N Subject: S C/H ,O 0=LG)l :R$L$S ) P -0 RN Then the solution is to use agrep. Make deletions of punctuation very low cost, as well as the usual transformations like: 0 => O 1 => l $ => S ... also be low-cost. (Of course, then you end up with the possibility of clash between deleting $ and replacing it with 'S', but agrep is good about checking both)... they you just grep through a dictionary of the "usual offenders": lesbian cash meds porn ... I'm not familiar with perl-String-Approx... reading up on it, it uses the Levenshtein distances just like agrep does... so it would be ideal for doing approximate matches. http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm -Philip