"Seth Goodman" <[EMAIL PROTECTED]> writes: <snip good stuff about how much more amazing the visual cortex is than any OCR algorithm can be> Yeah, sure, I know all that.
> Make OCR as "spam-specific" as you like, but it will require > tweaking each time spammers change to an unusual font, background > noise or text distortion. Not necessarily. There is voice recognition software that's resilient against minor variations in accent, noise, and distortions. In principle, the same could apply to OCR spam recognition, given the right models, so it wouldn't be "each time." > I don't want to seem morose about this, but I don't believe it's a > battle we can ultimately win. It can still assist Spambayes > classifying messages with image spam, but it's not a silver bullet. Yeah. The problem I'm having right now, I think, is that in those messages where the image spam isn't successfully OCR'd, the garbage words around the image get trained and degrade the overall performance of my system. Of course, that's just a guess, but it sure seems like these days a lot more plain spam messages that ought to be recognized as such are sneaking through than used to. > This is really a problem to be solved at the MTA with stricter > connection rules. What did you have in mind? > Nonetheless, I suspect that Spambayes could improve > by creating more synthetic tokens that describe the image better and > taking advantage of serendipitous differences between tokens for image > spam and those in each user's ham. I'm not sure what those attributes > are, but it probably beats trying to keep up with a quickly evolving > captcha. Outlook doesn't help the situation, as it destroys much of the > MIME armor that might provide useful spam clues. Fortunately, I'm not an Outlook slave. -- Dave Abrahams Boost Consulting www.boost-consulting.com _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
