David Abrahams wrote on Thursday, February 01, 2007 10:56 AM -0600: > "Seth Goodman" <[EMAIL PROTECTED]> writes: > > <snip good stuff about how much more amazing the visual cortex is > than any OCR algorithm can be> > Yeah, sure, I know all that. > > > Make OCR as "spam-specific" as you like, but it will require > > tweaking each time spammers change to an unusual font, background > > noise or text distortion. > > Not necessarily. There is voice recognition software that's resilient > against minor variations in accent, noise, and distortions. In > principle, the same could apply to OCR spam recognition, given the > right models, so it wouldn't be "each time."
As a practical example, people have been using AOI (automated optical inspection) in hardware manufacturing for years. Despite the obvious value of having such a technology work well, very few people who have actually used it will tell you that it's worth the trouble. This is not due to lack of effort or lack of talent applied to the problem. The difficulties involved with small differences of color temperature of lighting, surface reflectance and orientation changes make this a babysitting nightmare. OTOH, people look at the monitor and identify good parts from bad ones in a fraction of a second reliably. Each new visual "clue" causes the software folks to go away for a week or two to tweak the application. In theory, these should not be big problems, but they still are. The "right models" have eluded the best minds in the AOI business for a much more constrained problem, so I'm not very confident we can stay ahead of a group that actively obfuscates messages into images. > > > I don't want to seem morose about this, but I don't believe it's a > > battle we can ultimately win. It can still assist Spambayes > > classifying messages with image spam, but it's not a silver bullet. > > Yeah. The problem I'm having right now, I think, is that in those > messages where the image spam isn't successfully OCR'd, the garbage > words around the image get trained and degrade the overall performance > of my system. Of course, that's just a guess, but it sure seems like > these days a lot more plain spam messages that ought to be recognized > as such are sneaking through than used to. At least on my system, Spambayes works very well on non-image spam, and it is at least partly effective on image spam. The word salad they use to drown out significant clues generally fails, but if they throw enough words at it, they sometimes dilute the spam clues sufficiently. The fact that they throw hundreds of "noise" words at the filters for every spam clue they want to hide and Bayesian filters still catch half or three-quarters of it shows how powerful the Bayesian approach really is. Skip's OCR approach is just to bring us above the noise floor again on this class of spam. You only need a few good clues to push the classification over the threshold, so you can miss most of them and still succeed. > > This is really a problem to be solved at the MTA with stricter > > connection rules. > > What did you have in mind? There are a lot of clues that you use in an MTA when deciding which connections you accept. By combining a number of these behavioral clues, you can reject most of the garbage at the envelope stage of the SMTP transaction when it costs you the least. For every spam that Spambayes finds in your inbox, there are hundreds, sometimes thousands, of incoming messages that your MTA refuses to accept. A small improvement at this stage makes a big difference in what Spambayes has to classify. Since most spam today comes from trojaned Windows machines, anything that can differentiate those hosts from legitimate mail systems, especially at the envelope stage, are the clues you want to pay attention to. Here are a few examples: - zombie hosts tend to be weak on SMTP etiquette, so one clue is that they often fail to wait when asked; making the SMTP client wait for 30 seconds before sending the "connect banner" often tricks impatient zombies into spewing, and you can then hang up; - legitimate mail systems tend to have static IP's with properly configured reverse DNS that matches their forward DNS; zombies tend to have either no reverse DNS, or PTR records that do not match their A records, and their forward DNS is often dynamic; - legitimate mail systems generally identify themselves at the beginning of the SMTP conversation with a legitimate host name; zombies often try to use one of your host names, hoping to make you think you are talking to a local host on your own network, or a host name like "fred" that does not resolve to an IP address; There are a large number of other possible clues along these lines (behavioral heuristics), most of them not individually definitive. Reasonable people disagree on which clues are the most important and which you should ignore, so this knowledge is tricky to apply. If you can come up with enough different types of behavior to observe, you might apply Bayesian classification to some advantage over trying to figure out the significant correlations on your own. I don't know if you've played with rule-based spam filters that use word lists and regular expressions, but it's an interesting exercise and surprising how often our intuition is wrong. -- Seth Goodman _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
