Dear Tim, Thank you very much for your comprehensive reply and apologies to the group for putting my email to the wrong place. If I have anything more to write, I'll put it in the forum you mention.
thanks, James. > -----Original Message----- > From: Tim Peters [mailto:[EMAIL PROTECTED] > Sent: 03 August 2006 08:26 > To: [EMAIL PROTECTED] > Cc: [email protected] > Subject: Re: [spambayes-dev] Spambayes is starting not to work due to > retaliatory action by spammers > > > [EMAIL PROTECTED] > > I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been > > excellent. However, over the last couple of months, it has become > > compromised by a particular type of spam that I believe, > over time, will > > render Spambayes much less effective unless something is done. > > > > I expect you've seen these Spams - at the moment, they are > always the > > stock-market related ones > > I've seen a few drug spams using the same techniques, starting in July > -- but they seemed to dry up quickly. > > > but I'm sure once others catch on, they will start > > to use the same technique. The start of the email is a > picture that looks > > like ordinary text but isn't. All the spam info is in the > text. The > > picture is followed by a whole load of randomly selected words. > > You're probably not getting any reaction here because exactly the same > thing is currently being discussed on the SpamBayes "user" mailing > list, in this thread: > > Spam in Images > http://mail.python.org/pipermail/spambayes/2006-August/date.html > > > There are 2 bad things about this: > > > > 1. These spams are successfully evading Spambayes in some > cases. Firstly > > the Spam usually reaches the "possible Spam" folder. As a > result, I am now > > spending significant time clearing out the possible spam > folder whereas 2 or > > 3 months ago I wasn't. > > Same here, except the time isn't significant. If you don't believe > me, stop using SpamBayes for a week to rediscover what "significant" > means ;-) > > > Secondly, the odd spam is actually managing to get through > as ham. This > > is the first time this has happened ever. > > Not here -- they're very good at scoring Unsure, but haven't seen any > false negatives yet. > > > 2. Because I obviously mark these as Spam, all the > randomly generated words > > in each spam email have their spam likelihood scores > increased. The result > > of this is that over time, the spam-scores for loads of perfectly > > non-spam-like words are being gradually increased. The > more this goes on, > > the more these "ham words" are being compromised. > > I certainly haven't seen any ham pushed into "unsure" because of this, > and doubt it matters -- it generally doesn't hurt at all to have any > number of "ham words" show up in a few spam. One of the > characteristics of the spam you're talking about that /makes/ it > effective is that it's very good at /not/ repeating gibberish phrases > across messages. That's exactly why training on the gibberish is > ineffective at catching future messages of the same ilk. But, OTOH, > the non-repetition also prevents it from "poisoning" your strong ham > tokens. They get slightly less hammy, and that doesn't hurt because > most ham is nowhere near the unsure range. > > > I suspect that this is why, to begin with, I only saw a few > of these stock market > > emails, now I'm seeing loads > > The only reason you see loads of any kind of spam is that it's making > a profit for the sender. Pump-&-dump scams violate major securities > laws, and it's quite possible these scammers will quit before getting > too greedy (= getting caught). > > > and over the last 2 or 3 weeks some have started to come in as ham. > > While I haven't seen that, it's inconsistent with your explanation > above: if your "ham tokens" /were/ being compromised, that makes it > /less/ likely that a message containing your ham tokens will be scored > as ham, not more likely. > > A more likely explanation is simply "loads": gibberish does have a > real chance of scoring as ham, and the more attempts are made, the > more likely one will succeed. What they can't do is craft a message > that scores as ham for all users, or even for most. > > > I fear that the long term effect of this will be to spoil > spambayes bigtime. > > Possibly. People have panicked prematurely before ;-) > > > I know that Spambayes has a deep-rooted principle in only > using the bayesian > > algorithm and I wouldn't suggest changing that. However, I > am wondering if > > it might be possible to analyse these messages and include > some parts of the > > hidden text relating to the picture that are not presently > included in the > > bayesian statistics. > > See the thread above. Nobody knows a realistic way to extract the > text from these images (there is no "text" here -- just a large matrix > of individual pixels, something the human eye/brain system is very > much better at decoding than programs). OTOH, the images themselves > probably have many statistical characteristics not shared with > "legitimate" images, and those can be computed/extracted with finite > effort. > > > My thesis is this - I rarely get pictures in my email that > are not just attachments - > > virtually all pictures that are embedded into the mail seem > to be spam. > > Of course that varies. For example, it's very easy to create embedded > pictures in Outlook, and even small children know how to do it. > Worse, their grandparents are required by law to consider such email > "ham" :-) > > > So if there is some token or tag in the email that > represents the embedded picture > > that can be included in the bayesian analysis, this would > might fix the problem. > > This is harder in Outlook because Outlook destroys the original MIME > structure of the email before SpamBayes sees it. There are already > several such tokens generated when the original MIME structure is > available. In Outlook, it's most likely you'll get the single > synthesized token: > > virus:src="cid: > > or a simple variation on that, and that's all that remains of the > embedded GIF. A single token helps a bit, but not enough. Do note > that pump-&-dump scams don't even contain a URL to click on: they > want you to buy the stock on the open market, not send them money > directly. That also makes it a unique (and uniquely effective) kind > of spam: the pitch is /entirely/ buried in the GIF, with no useful > text (not even a URL) of any kind to tokenize. > > > I hope that this suggestion is useful - I certainly fear > for the future of > > Spambayes if this new spam threat is not dealt with.... > > Don't assume that most spammers are capable of becoming competent :-) > _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
