So, I've been looking over the types of messages that SA is missing, on my mail stream... It seems like most of them still contain trigger words that would cause high scores, they're just slightly masked. Has anyone talked about applying some kind of fuzzy-matching techniques? Taking the trigger words, and generating a whole large set of patterns that match, based on rules such as:
'a' => /a|A|@)/ 'x' => /x|X|></ You might even be able to use a large corpus of spam to automatically derive these rules. (A corpus of parsed-out and "translated" tokens would work better, obviously.) You could also introduce some Hamming Distance effects to the match, so that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and "hillo". And then there's the possibility of doing phonetic matching, like many spellcheckers. Using any/all of this stuff would be pretty processor intensive -- probably much more practical for ISPs than for users -- but it seems like it'd kill off almost all of the new crop of SA-evading spam. Maybe somebody could lure Larry Wall into building this kind of fuzzy-match technology directly into the next major version of Perl? *g* Just thought I'd throw that out there. Aside from that, I'll probably lurk for a while; if I end up feeling out of my depth (which is possible -- my actual day job is as a linguist, and most of my coding skills, such as they are, are aimed at that) I'll unsub. Thanks, Auros ------------------------------------------------------------------------ R Michael Harman / Auros Symtheos [EMAIL PROTECTED] ............ http://www.auros.org/ Linguist and Eclectic Engineer, Lexicus, Motorola [EMAIL PROTECTED] ......... http://www.lexicus.mot.com/ Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly [EMAIL PROTECTED] ... http://www.strangehorizons.com/
