With the current crop of pump & dump spams I decided to break down and actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would help. It does a miserable job from a readability standpoint at extracting text from an image, but SpamBayes seems to love what it does generate. This morning I thought, "what the hell", and checked in all the current new tricks I've been working on/with:
* IP address lookup and more extensive tokenization. This is from Matt Cowles. I added persistence beyond the current run. Unfortunately, the dbm persistence is untested (though should probably work okay) while the zodb persistence still has problems (writes the file the first time, but doesn't update it on successive runs). Maybe someone can look at those issues. This seems to work very well for those spams where the only useful clue is a URL, but with a domain name that changes each time. They seem to pretty much all point to the same IP address as far as I can tell. Enabled using the x-lookup_ip and lookup_ip_cache options. Requires installation of PyDNS. * Note image size. This was my first stab at trying to get some information out of an image. Seems to work pretty well. Enabled using the x-image_size option. * Note short runs of too-short words. Text spammers (as opposed to image spammers) seem to like to use this technique: X j A m N j A d X h M k E z R d I p D u I m A c C o I d A t L j I v S j to hide their tokens from spam filters. Enabled using the x-short_runs option. Based on my current database I'm skeptical this will add much over what else we already have. * Try OCR on images. The latest technique we've all encountered seems to be the pump and dump stock scams where the entire come-on is embedded in one or more GIF images. I wrote a small ImageStripper module which handles these. It grabs the image parts, converts them to netpbm format, concatenates them left-to-right, then submits the result to ocrad. This is just a proof-of-concept. It requires ocrad and netpbm to be available. As such I suspect it will only run currently on Unix-like systems. Enabled using the x-crack_images and max_image_size options. I added these extensions using multiple checkins, so if we decide to back one or more of them out it shouldn't be a major PITA. Skip _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev