[EMAIL PROTECTED]
> I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been
> excellent.  However, over the last couple of months, it has become
> compromised by a particular type of spam that I believe, over time, will
> render Spambayes much less effective unless something is done.
>
> I expect you've seen these Spams - at the moment, they are always the
> stock-market related ones

I've seen a few drug spams using the same techniques, starting in July
-- but they seemed to dry up quickly.

> but I'm sure once others catch on, they will start
> to use the same technique.  The start of the email is a picture that looks
> like ordinary text but isn't.  All the spam info is in the text.  The
> picture is followed by a whole load of randomly selected words.

You're probably not getting any reaction here because exactly the same
thing is currently being discussed on the SpamBayes "user" mailing
list, in this thread:

    Spam in Images
    http://mail.python.org/pipermail/spambayes/2006-August/date.html

> There are 2 bad things about this:
>
> 1.  These spams are successfully evading Spambayes in some cases.  Firstly
> the Spam usually reaches the "possible Spam" folder.  As a result, I am now
> spending significant time clearing out the possible spam folder whereas 2 or
> 3 months ago I wasn't.

Same here, except the time isn't significant.  If you don't believe
me, stop using SpamBayes for a week to rediscover what "significant"
means ;-)

>  Secondly, the odd spam is actually managing to get through as ham.  This
> is the first time this has happened ever.

Not here -- they're very good at scoring Unsure, but haven't seen any
false negatives yet.

> 2.  Because I obviously mark these as Spam, all the randomly generated words
> in each spam email have their spam likelihood scores increased.  The result
> of this is that over time, the spam-scores for loads of perfectly
> non-spam-like words are being gradually increased.  The more this goes on,
> the more these "ham words" are being compromised.

I certainly haven't seen any ham pushed into "unsure" because of this,
and doubt it matters -- it generally doesn't hurt at all to have any
number of "ham words" show up in a few spam.  One of the
characteristics of the spam you're talking about that /makes/ it
effective is that it's very good at /not/ repeating gibberish phrases
across messages.  That's exactly why training on the gibberish is
ineffective at catching future messages of the same ilk.  But, OTOH,
the non-repetition also prevents it from "poisoning" your strong ham
tokens.  They get slightly less hammy, and that doesn't hurt because
most ham is nowhere near the unsure range.

> I suspect that this is why, to begin with, I only saw a few of these stock 
> market
> emails, now I'm seeing loads

The only reason you see loads of any kind of spam is that it's making
a profit for the sender.  Pump-&-dump scams violate major securities
laws, and it's quite possible these scammers will quit before getting
too greedy (= getting caught).

> and over the last 2 or 3 weeks some have started to come in as ham.

While I haven't seen that, it's inconsistent with your explanation
above:  if your "ham tokens" /were/ being compromised, that makes it
/less/ likely that a message containing your ham tokens will be scored
as ham, not more likely.

A more likely explanation is simply "loads":  gibberish does have a
real chance of scoring as ham, and the more attempts are made, the
more likely one will succeed.  What they can't do is craft a message
that scores as ham for all users, or even for most.

> I fear that the long term effect of this will be to spoil spambayes bigtime.

Possibly.  People have panicked prematurely before ;-)

> I know that Spambayes has a deep-rooted principle in only using the bayesian
> algorithm and I wouldn't suggest changing that.  However, I am wondering if
> it might be possible to analyse these messages and include some parts of the
> hidden text relating to the picture that are not presently included in the
> bayesian statistics.

See the thread above.  Nobody knows a realistic way to extract the
text from these images (there is no "text" here -- just a large matrix
of individual pixels, something the human eye/brain system is very
much better at decoding than programs).  OTOH, the images themselves
probably have many statistical characteristics not shared with
"legitimate" images, and those can be computed/extracted with finite
effort.

> My thesis is this - I rarely get pictures in my email that are not just 
> attachments -
> virtually all pictures that are embedded into the mail seem to be spam.

Of course that varies.  For example, it's very easy to create embedded
pictures in Outlook, and even small children know how to do it.
Worse, their grandparents are required by law to consider such email
"ham" :-)

> So if there is some token or tag in the email that represents the embedded 
> picture
> that can be included in the bayesian analysis, this would might fix the 
> problem.

This is harder in Outlook because Outlook destroys the original MIME
structure of the email before SpamBayes sees it.  There are already
several such tokens generated when the original MIME structure is
available.  In Outlook, it's most likely you'll get the single
synthesized token:

    virus:src="cid:

or a simple variation on that, and that's all that remains of the
embedded GIF.  A single token helps a bit, but not enough.  Do note
that pump-&-dump scams don't even contain a URL to click on:  they
want you to buy the stock on the open market, not send them money
directly.  That also makes it a unique (and uniquely effective) kind
of spam:  the pitch is /entirely/ buried in the GIF, with no useful
text (not even a URL) of any kind to tokenize.

> I hope that this suggestion is useful - I certainly fear for the future of
> Spambayes if this new spam threat is not dealt with....

Don't assume that most spammers are capable of becoming competent :-)
_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to