I just checked in a couple significant changes to the OCR stuff. First, I added support for conversion of input images using PIL. That means netpbm is no longer required. PIL is faster and more robust than netpbm, and is platform-independent. Perhaps someone in Windows-land can take the time to see if it's possible to build ocrad on Windows. We could then (in theory, at least) distribute an ocrad installer alongside the SpamBayes Windows installer and perform crude, but apparently effective, OCR analysis of image-based spam. The second change to the OCR code was the addition of a simple pickled cache file (controlled by the "crack_image_cache" option). The conversion to netpbm format is still required, however the ocrad step is skipped if the md5 hexdigest of the generated image is present in the cache. In thi case any cached text and tokens are returned.
I have no Windows capability, so someone else will have to take the steps necessary to make this all play on Windows. There are a few other things that need testing, but I'm out of time. First, I arbitrarily set an upper limit of 100kbytes on input images (per image before converting to netpbm). I think that allows all images that would hold spam content, but I'm not sure I have many images in my training database besides spam. I don't know if that's a useful cutoff or if there should even be a cutoff. Second, I observed that ocrad routinely seemed to get the letter case wrong (e.g. coming up with "EGLy" instead of "EGLY"), so I blindly downshift its output. I have nothing other than that simple observation to suggest that should be done. Third, if other people have traing databases, running N-fold cross validation tests of these new gimmicks would be beneficial. It would be nice if others could verify my results before a new release is made. Finally, if you're a Python programmer (or aspire to be one), picking through the new code would be a good check. Too bad the summer's nearly over. We could use a Summer of Code intern... Skip _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
