[Alan Arndt]
> Over the past month or more I have noticed a large increase in the amount of
> spam I receive with the Spam text translated into images. The actual text
> of the message is benign gibberish designed to pass Bayesian filters. They
> have even taken the step of inserting random bits into the image so that no
> two images have the same signature. I've received many multiple messages
> with the same fundamental image.
Yup, and they're learning to avoid other stupid mistakes too; e.g.,
the gibberish /changes/ from one message to the next, and so does the
forged sender address. While randomization isn't new in spam, most
spammers have traditionally done a poor job on it. For example, for a
long time it was very effective to train on the gibberish, since
multiple spammers appeared to use randomization software that produced
the /same/ gibberish time after time. Likewise they tended to forge
the same sender addresses repeatedly. Most spam still does, for that
matter. But some spammers have gotten much smarter.
> I haven't thought of a decent way to filter these types of things.
Me neiither. They're never false negatives for me, but I reliably get
a few unsures every day from what appears to be the same pump-and-dump
scam-spam source (these are messages hard-selling specific penny
stocks -- the scammer hopes to drive up the market price ("pump") by
stimulating demand, and then sell quick at a profit ("dump")).
It's very much in the spirit of SpamBayes to generate tokens for what
the user /sees/, but in these cases we have no idea what the user sees
(except for the gibberish text).
BTW, it's typical of pump-and-dump scams that they're not trying to
extract money /directly / from you (they're trying to get you to buy a
stock on the open market), so we don't even get a URL or mailing
address to tokenize.
> I hope someone else can and that it can get implemented into SpamBayes.
It's discussed here (maybe more so on spambayes-dev, the related
developers' mailing list) regularly, but AFAICT extracting readable
text from images is a complicated and expensive job. If someone finds
a programmatic way to do it cheaply and with reasonable accuracy, I'm
sure SB could make excellent use of it.
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html