David Abrahams wrote on Thursday, February 01, 2007 10:56 AM -0600:

> "Seth Goodman" <[EMAIL PROTECTED]> writes:
>
> <snip good stuff about how much more amazing the visual cortex is
> than any OCR algorithm can be>
> Yeah, sure, I know all that.
>
> > Make OCR as "spam-specific" as you like, but it will require
> > tweaking each time spammers change to an unusual font, background
> > noise or text distortion.
>
> Not necessarily.  There is voice recognition software that's resilient
> against minor variations in accent, noise, and distortions.  In
> principle, the same could apply to OCR spam recognition, given the
> right models, so it wouldn't be "each time."

As a practical example, people have been using  AOI (automated optical
inspection) in hardware manufacturing for years.  Despite the obvious
value of having such a technology work well, very few people who have
actually used it will tell you that it's worth the trouble.  This is not
due to lack of effort or lack of talent applied to the problem.  The
difficulties involved with small differences of color temperature of
lighting, surface reflectance and orientation changes make this a
babysitting nightmare.  OTOH, people look at the monitor and identify
good parts from bad ones in a fraction of a second reliably.  Each new
visual "clue" causes the software folks to go away for a week or two to
tweak the application.  In theory, these should not be big problems, but
they still are.

The "right models" have eluded the best minds in the AOI business for a
much more constrained problem, so I'm not very confident we can stay
ahead of a group that actively obfuscates messages into images.


>
> > I don't want to seem morose about this, but I don't believe it's a
> > battle we can ultimately win.  It can still assist Spambayes
> > classifying messages with image spam, but it's not a silver bullet.
>
> Yeah.  The problem I'm having right now, I think, is that in those
> messages where the image spam isn't successfully OCR'd, the garbage
> words around the image get trained and degrade the overall performance
> of my system.  Of course, that's just a guess, but it sure seems like
> these days a lot more plain spam messages that ought to be recognized
> as such are sneaking through than used to.

At least on my system, Spambayes works very well on non-image spam, and
it is at least partly effective on image spam.  The word salad they use
to drown out significant clues generally fails, but if they throw enough
words at it, they sometimes dilute the spam clues sufficiently.  The
fact that they throw hundreds of "noise" words at the filters for every
spam clue they want to hide and Bayesian filters still catch half or
three-quarters of it shows how powerful the Bayesian approach really is.
Skip's OCR approach is just to bring us above the noise floor again on
this class of spam.  You only need a few good clues to push the
classification over the threshold, so you can miss most of them and
still succeed.


> > This is really a problem to be solved at the MTA with stricter
> > connection rules.
>
> What did you have in mind?

There are a lot of clues that you use in an MTA when deciding which
connections you accept.  By combining a number of these behavioral
clues, you can reject most of the garbage at the envelope stage of the
SMTP transaction when it costs you the least.  For every spam that
Spambayes finds in your inbox, there are hundreds, sometimes thousands,
of incoming messages that your MTA refuses to accept.  A small
improvement at this stage makes a big difference in what Spambayes has
to classify.  Since most spam today comes from trojaned Windows
machines, anything that can differentiate those hosts from legitimate
mail systems, especially at the envelope stage, are the clues you want
to pay attention to.  Here are a few examples:

- zombie hosts tend to be weak on SMTP etiquette, so one clue is that
they often fail to wait when asked; making the SMTP client wait for 30
seconds before sending the "connect banner" often tricks impatient
zombies into spewing, and you can then hang up;

- legitimate mail systems tend to have static IP's with properly
configured reverse DNS that matches their forward DNS; zombies tend to
have either no reverse DNS, or PTR records that do not match their A
records, and their forward DNS is often dynamic;

- legitimate mail systems generally identify themselves at the beginning
of the SMTP conversation with a legitimate host name; zombies often try
to use one of your host names, hoping to make you think you are talking
to a local host on your own network, or a host name like "fred" that
does not resolve to an IP address;

There are a large number of other possible clues along these lines
(behavioral heuristics), most of them not individually definitive.
Reasonable people disagree on which clues are the most important and
which you should ignore, so this knowledge is tricky to apply.  If you
can come up with enough different types of behavior to observe, you
might apply Bayesian classification to some advantage over trying to
figure out the significant correlations on your own.  I don't know if
you've played with rule-based spam filters that use word lists and
regular expressions, but it's an interesting exercise and surprising how
often our intuition is wrong.

--
Seth Goodman

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to