Matt Kettler wrote: >Gustafson, Tim wrote: > > >>Hello >> >>One thing I've noticed about almost ALL spam that gets through at this >>point is that they have a LOT of misspelled (and obfuscated) words. >> >>Could SpamAssassin benefit from a filter that would actually check the >>spelling of the text parts of the message, and if misspelled words >>exceeds, for example, 50%, then we can add a few points to the SPAM >>score? >> >> >Been suggested many times. MANY times. > >Some drawbacks to consider: > >1) FPs on highly technical mail due to words not known to the spell >checker. > >2) FPs on email sent by folks of the text-message generation. (OMG did u >c he 8 it all!) > >3) FPs on email sent by lazy/stupid folks that can't spell. >(Translation: management material) > >4) relatively quick and easy for spammers to adapt to. > >5) Relatively high CPU usage, given the above caveats in accurate. > >While the "50% misspelled" category restriction handles most cases of 3, >it won't deal with 1 or 2. It also makes 4 very easy, all they need to >do is insert a book-quote block. > >Very little spam has over 50% of its words misspelled right now except >drug-obfu spam. Those guys will adapt to this in a flash, as they're >VERY aggressive about optimizing for SA. > >(waves at the pharmacorp spammer who is likely reading this.) > >
I was thinking about that, and using agrep (which has a Perl module and is actually really handy) to check for common spam keywords that get misspelled in a list (we'll call it a "brownlist', as it's filled with uninteresting crap)... I.e. brownlist cialis vicodin viagra ... brownlist porn penis ... and then words could be looked aside in the brownlist for approximate matches... You can do some interesting things in agrep, for instance making repetitions of certain letters be an extremely low-distance change, so that: cialis => ciallis is a very small vector distance (nearly identical). Ditto for transpositions and substitutions: cialis => cia1is (note the L being replaced with a ONE). One this was working, there wouldn't be any benefit to misspelling keywords, as they would be equally likely to trip the filters. -Philip > > >>I'm not sure how to begin coding this, but I think it should be >>pretty easy (using pSpell or aSpell or something) and I think it would >>be a very useful tool. >> >> >I don't think it would be.. IIRC someone actually tried this out in a >test/devel kinda way about 2 or 3 years ago. > >