Dear Spambayes developers, I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been excellent. However, over the last couple of months, it has become compromised by a particular type of spam that I believe, over time, will render Spambayes much less effective unless something is done.
I expect you've seen these Spams - at the moment, they are always the stock-market related ones but I'm sure once others catch on, they will start to use the same technique. The start of the email is a picture that looks like ordinary text but isn't. All the spam info is in the text. The picture is followed by a whole load of randomly selected words. There are 2 bad things about this: 1. These spams are successfully evading Spambayes in some cases. Firstly the Spam usually reaches the "possible Spam" folder. As a result, I am now spending significant time clearing out the possible spam folder whereas 2 or 3 months ago I wasn't. Secondly, the odd spam is actually managing to get through as ham. This is the first time this has happened ever. 2. Because I obviously mark these as Spam, all the randomly generated words in each spam email have their spam likelihood scores increased. The result of this is that over time, the spam-scores for loads of perfectly non-spam-like words are being gradually increased. The more this goes on, the more these "ham words" are being compromised. I suspect that this is why, to begin with, I only saw a few of these stock market emails, now I'm seeing loads and over the last 2 or 3 weeks some have started to come in as ham. I fear that the long term effect of this will be to spoil spambayes bigtime. I know that Spambayes has a deep-rooted principle in only using the bayesian algorithm and I wouldn't suggest changing that. However, I am wondering if it might be possible to analyse these messages and include some parts of the hidden text relating to the picture that are not presently included in the bayesian statistics. My thesis is this - I rarely get pictures in my email that are not just attachments - virtually all pictures that are embedded into the mail seem to be spam. So if there is some token or tag in the email that represents the embedded picture that can be included in the bayesian analysis, this would might fix the problem. I hope that this suggestion is useful - I certainly fear for the future of Spambayes if this new spam threat is not dealt with.... thanks for reading, James Masters. _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev