Re: [spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers

James Masters Sat, 05 Aug 2006 14:16:27 -0700

Dear Tim,

Thank you very much for your comprehensive reply and apologies to the group
for putting my email to the wrong place.  If I have anything more to write,
I'll put it in the forum you mention.


thanks,

James.

> -----Original Message-----
> From: Tim Peters [mailto:[EMAIL PROTECTED]
> Sent: 03 August 2006 08:26
> To: [EMAIL PROTECTED]
> Cc: [email protected]
> Subject: Re: [spambayes-dev] Spambayes is starting not to work due to
> retaliatory action by spammers
>
>
> [EMAIL PROTECTED]
> > I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been
> > excellent.  However, over the last couple of months, it has become
> > compromised by a particular type of spam that I believe,
> over time, will
> > render Spambayes much less effective unless something is done.
> >
> > I expect you've seen these Spams - at the moment, they are
> always the
> > stock-market related ones
>
> I've seen a few drug spams using the same techniques, starting in July
> -- but they seemed to dry up quickly.
>
> > but I'm sure once others catch on, they will start
> > to use the same technique.  The start of the email is a
> picture that looks
> > like ordinary text but isn't.  All the spam info is in the
> text.  The
> > picture is followed by a whole load of randomly selected words.
>
> You're probably not getting any reaction here because exactly the same
> thing is currently being discussed on the SpamBayes "user" mailing
> list, in this thread:
>
>     Spam in Images
>     http://mail.python.org/pipermail/spambayes/2006-August/date.html
>
> > There are 2 bad things about this:
> >
> > 1.  These spams are successfully evading Spambayes in some
> cases.  Firstly
> > the Spam usually reaches the "possible Spam" folder.  As a
> result, I am now
> > spending significant time clearing out the possible spam
> folder whereas 2 or
> > 3 months ago I wasn't.
>
> Same here, except the time isn't significant.  If you don't believe
> me, stop using SpamBayes for a week to rediscover what "significant"
> means ;-)
>
> >  Secondly, the odd spam is actually managing to get through
> as ham.  This
> > is the first time this has happened ever.
>
> Not here -- they're very good at scoring Unsure, but haven't seen any
> false negatives yet.
>
> > 2.  Because I obviously mark these as Spam, all the
> randomly generated words
> > in each spam email have their spam likelihood scores
> increased.  The result
> > of this is that over time, the spam-scores for loads of perfectly
> > non-spam-like words are being gradually increased.  The
> more this goes on,
> > the more these "ham words" are being compromised.
>
> I certainly haven't seen any ham pushed into "unsure" because of this,
> and doubt it matters -- it generally doesn't hurt at all to have any
> number of "ham words" show up in a few spam.  One of the
> characteristics of the spam you're talking about that /makes/ it
> effective is that it's very good at /not/ repeating gibberish phrases
> across messages.  That's exactly why training on the gibberish is
> ineffective at catching future messages of the same ilk.  But, OTOH,
> the non-repetition also prevents it from "poisoning" your strong ham
> tokens.  They get slightly less hammy, and that doesn't hurt because
> most ham is nowhere near the unsure range.
>
> > I suspect that this is why, to begin with, I only saw a few
> of these stock market
> > emails, now I'm seeing loads
>
> The only reason you see loads of any kind of spam is that it's making
> a profit for the sender.  Pump-&-dump scams violate major securities
> laws, and it's quite possible these scammers will quit before getting
> too greedy (= getting caught).
>
> > and over the last 2 or 3 weeks some have started to come in as ham.
>
> While I haven't seen that, it's inconsistent with your explanation
> above:  if your "ham tokens" /were/ being compromised, that makes it
> /less/ likely that a message containing your ham tokens will be scored
> as ham, not more likely.
>
> A more likely explanation is simply "loads":  gibberish does have a
> real chance of scoring as ham, and the more attempts are made, the
> more likely one will succeed.  What they can't do is craft a message
> that scores as ham for all users, or even for most.
>
> > I fear that the long term effect of this will be to spoil
> spambayes bigtime.
>
> Possibly.  People have panicked prematurely before ;-)
>
> > I know that Spambayes has a deep-rooted principle in only
> using the bayesian
> > algorithm and I wouldn't suggest changing that.  However, I
> am wondering if
> > it might be possible to analyse these messages and include
> some parts of the
> > hidden text relating to the picture that are not presently
> included in the
> > bayesian statistics.
>
> See the thread above.  Nobody knows a realistic way to extract the
> text from these images (there is no "text" here -- just a large matrix
> of individual pixels, something the human eye/brain system is very
> much better at decoding than programs).  OTOH, the images themselves
> probably have many statistical characteristics not shared with
> "legitimate" images, and those can be computed/extracted with finite
> effort.
>
> > My thesis is this - I rarely get pictures in my email that
> are not just attachments -
> > virtually all pictures that are embedded into the mail seem
> to be spam.
>
> Of course that varies.  For example, it's very easy to create embedded
> pictures in Outlook, and even small children know how to do it.
> Worse, their grandparents are required by law to consider such email
> "ham" :-)
>
> > So if there is some token or tag in the email that
> represents the embedded picture
> > that can be included in the bayesian analysis, this would
> might fix the problem.
>
> This is harder in Outlook because Outlook destroys the original MIME
> structure of the email before SpamBayes sees it.  There are already
> several such tokens generated when the original MIME structure is
> available.  In Outlook, it's most likely you'll get the single
> synthesized token:
>
>     virus:src="cid:
>
> or a simple variation on that, and that's all that remains of the
> embedded GIF.  A single token helps a bit, but not enough.  Do note
> that pump-&-dump scams don't even contain a URL to click on:  they
> want you to buy the stock on the open market, not send them money
> directly.  That also makes it a unique (and uniquely effective) kind
> of spam:  the pitch is /entirely/ buried in the GIF, with no useful
> text (not even a URL) of any kind to tokenize.
>
> > I hope that this suggestion is useful - I certainly fear
> for the future of
> > Spambayes if this new spam threat is not dealt with....
>
> Don't assume that most spammers are capable of becoming competent :-)
>

_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

Re: [spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers

Reply via email to