Re: On bichromatic GIF stock spam
Loren Wilton wrote: No, I was thinking of multipart/alternative where one of the alternative streams is nothing but images. That doesn't strike me as legitimate. Can anyone think of a scenario where images *are* a legitimate alternative representation of text? Doesn't really help. The actual mails have a tiny gibberish text part, and a tiny to medium html part that has a few words of gibberish (usually the same as the text part) and the rest is calls to images. So there really is an html part. I did a trivial test for alternative and gif, and it didn't pan out very well. Will need some additional conditions to make it more usable. Loren What Perl modules are there that can process (decode, perform certain inspections and histogram analysis, etc) of GIF files? I'd like to throw something together... -Philip
Re: On bichromatic GIF stock spam
No, I was thinking of multipart/alternative where one of the alternative streams is nothing but images. That doesn't strike me as legitimate. Can anyone think of a scenario where images *are* a legitimate alternative representation of text? Doesn't really help. The actual mails have a tiny gibberish text part, and a tiny to medium html part that has a few words of gibberish (usually the same as the text part) and the rest is calls to images. So there really is an html part. I did a trivial test for alternative and gif, and it didn't pan out very well. Will need some additional conditions to make it more usable. Loren
Re: On bichromatic GIF stock spam
On Sat, 24 Jun 2006, Philip Prindeville wrote: the text and the images. The spammers send multipart/alternative because they want the text/plain section to confuse the Bayes filters, since they know it won't be rendered... It seems to me that right there is the spam sign you should be looking for, then, and save all the heavy-duty mathematical analysis of the images themselves. -- John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...every time I sit down in front of a Windows machine I feel as if the computer is just a place for the manufacturers to put their advertising. -- fwadling on Y! SCOX --
Re: On bichromatic GIF stock spam
John D. Hardin wrote: On Sat, 24 Jun 2006, Philip Prindeville wrote: the text and the images. The spammers send multipart/alternative because they want the text/plain section to confuse the Bayes filters, since they know it won't be rendered... It seems to me that right there is the spam sign you should be looking for, then, and save all the heavy-duty mathematical analysis of the images themselves. A lot of mailers generate multipart/alternative legitimately, though if you ask me sending both text/plain and text/html is bogus and no one should configure their mailer to do that. -Philip
Re: On bichromatic GIF stock spam
On Sun, 25 Jun 2006, Philip Prindeville wrote: John D. Hardin wrote: On Sat, 24 Jun 2006, Philip Prindeville wrote: The spammers send multipart/alternative because they want the text/plain section to confuse the Bayes filters, since they know it won't be rendered... It seems to me that right there is the spam sign you should be looking for, then, and save all the heavy-duty mathematical analysis of the images themselves. A lot of mailers generate multipart/alternative legitimately, No, I was thinking of multipart/alternative where one of the alternative streams is nothing but images. That doesn't strike me as legitimate. Can anyone think of a scenario where images *are* a legitimate alternative representation of text? -- John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...every time I sit down in front of a Windows machine I feel as if the computer is just a place for the manufacturers to put their advertising. -- fwadling on Y! SCOX --
Re: On bichromatic GIF stock spam
On Sun, 25 Jun 2006, John D. Hardin wrote: On Sun, 25 Jun 2006, Philip Prindeville wrote: John D. Hardin wrote: On Sat, 24 Jun 2006, Philip Prindeville wrote: The spammers send multipart/alternative because they want the text/plain section to confuse the Bayes filters, since they know it won't be rendered... [snip..] No, I was thinking of multipart/alternative where one of the alternative streams is nothing but images. That doesn't strike me as legitimate. Can anyone think of a scenario where images *are* a legitimate alternative representation of text? Sounds good in theory but difficult to implement. The HTML part is not empty, contains comments, font control junk, and 'glue' to stitch together those multiple fragment gifs. So you'd have to run it thru a html parsing engine (al'a lynx or pine) to determine that the textural components render down to nothing. Here's what works for me; I wrote a collection of custom rules that recognizes that particular HTML structure and gave it a small but sufficient score. (sufficient in this case is enough to make up the difference between my spam threshold and a BAYES_99 score but not so large as to cause FPs for legit messages that also have that structure). So that MIME structure + BAYES_99 == spam. Then by keeping bayes reasonably well fed those things get hit pretty reliably. That way network test (RBLS, Razor, DCC, etc) are just icing on the cake. Dave -- Dave Funk University of Iowa dbfunk (at) engineering.uiowa.eduCollege of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include std_disclaimer.h Better is not better, 'standard' is better. B{
Re: On bichromatic GIF stock spam
On Sun, 25 Jun 2006, David B Funk wrote: On Sun, 25 Jun 2006, John D. Hardin wrote: No, I was thinking of multipart/alternative where one of the alternative streams is nothing but images. That doesn't strike me as legitimate. Can anyone think of a scenario where images *are* a legitimate alternative representation of text? Sounds good in theory but difficult to implement. The HTML part is not empty, contains comments, font control junk, and 'glue' to stitch together those multiple fragment gifs. D'oh! I forgot about the HTML glue... So it denegenerates to a standard multipart/alternative text + html message. Rats. -- John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...every time I sit down in front of a Windows machine I feel as if the computer is just a place for the manufacturers to put their advertising. -- fwadling on Y! SCOX --
Re: On bichromatic GIF stock spam
On Sun, Jun 25, 2006 at 12:49:17PM -0600, Philip Prindeville wrote: No, I was thinking of multipart/alternative where one of the alternative streams is nothing but images. That doesn't strike me as legitimate. Can anyone think of a scenario where images *are* a legitimate alternative representation of text? Sure, it's the same idea as having PDF as an alternate representation. A picture is worth a thousand words and all that. However, with that said, the question/answer isn't actually telling you anything useful in this situation. You want to know whether or not m/a parents w/ non-text children is a useful spam sign... Well, let's instrument it and see... run the spam v. ham numbers. It's not bad (taking into account multipart/related children as well): 1.909 2.3080 0.1.000 1.000.01 T_MULTIPART_ALT_NON_TEXT -- Randomly Generated Tagline: I'd rather get it right than get it done on Tuesday. - J. Michael Straczynski pgpR2CteogFUt.pgp Description: PGP signature
RE: On bichromatic GIF stock spam
-Original Message- From: Philip Prindeville [mailto:[EMAIL PROTECTED] Sent: Saturday, June 24, 2006 2:10 PM To: users@spamassassin.apache.org Subject: On bichromatic GIF stock spam I get a lot of spam that looks like: http://pastebin.com/729105 on the alsa-devel mailing list, amongst others... And noticed the following. If you decompress the GIF file and decode it into a pixmap image, then do a color histogram of the image, you notice two things immediately. Or feed it through character recognition software and then replace the gif attachment with a plain text attachment and reinject it back into SA.
Re: On bichromatic GIF stock spam
Michael Scheidell wrote: -Original Message- From: Philip Prindeville [mailto:[EMAIL PROTECTED] Sent: Saturday, June 24, 2006 2:10 PM To: users@spamassassin.apache.org Subject: On bichromatic GIF stock spam I get a lot of spam that looks like: http://pastebin.com/729105 on the alsa-devel mailing list, amongst others... And noticed the following. If you decompress the GIF file and decode it into a pixmap image, then do a color histogram of the image, you notice two things immediately. Or feed it through character recognition software and then replace the gif attachment with a plain text attachment and reinject it back into SA. Well, yeah, and that's already been discussed... I wanted an alternative to that that might be less CPU intensive. -Philip
Re: On bichromatic GIF stock spam
If, after excluding black, we find that 100% of the color map is that nasty pastel pink or pastel lime green (etc) then it's a spam and we toss it. Sound reasonable? I was thinking about this the other day. I think the concept is reasonable, but as stated doesn't go far enough, and would be trivial to bypass. I think that someone first needs to come up with either a formula or a list of RGB triples that are visually indistinguishable or some such. (I suspect this has been done several times now and the research should exist in the wild.) This can then be used as a fuzz to group colors that are very close down into a common bucket. As it is, trivial 1-bit variations on colors would defeat the simple scheme. It might also be interesting to accumulate a) total area of each color and b) largest rectangle (or other easily detected shape) of each color. The first case we would have from the pixel counts. The second case could be used to detect large areas of fill color. This might help classify a text message vs a map of the world or a picture of downtown Camaroon. It also might be interesting to accumulate statistics on the common color distributions for 10K or so legit images sent through email, possibly along with some sort of indication of purpose: picture of me, picture of my dog, billboard I saw, kids at Christmas, Hallmark greeting card, etc. With that info the color distribution might be able to help classify the image fairly cheaply. I don't know how much of the above would be absolutely necessary, but I suspect at least some of it is. Still, this is a fairly trivial sort of thing to have to accumulate. Expecially since all spam (at least currently) uses gifs, which a blind man can decode with a hair comb - no fancy software required. Loren
Re: On bichromatic GIF stock spam
Loren Wilton wrote: If, after excluding black, we find that 100% of the color map is that nasty pastel pink or pastel lime green (etc) then it's a spam and we toss it. Sound reasonable? I was thinking about this the other day. I think the concept is reasonable, but as stated doesn't go far enough, and would be trivial to bypass. I think that someone first needs to come up with either a formula or a list of RGB triples that are visually indistinguishable or some such. (I suspect this has been done several times now and the research should exist in the wild.) This can then be used as a fuzz to group colors that are very close down into a common bucket. As it is, trivial 1-bit variations on colors would defeat the simple scheme. Shh they might be listening... ;-) Seriously, though, how many people send out 2-color GIFs (besides BW scans of Dilbert and faxes) as email? The formula is: sqrt((r1 - r2) ^2 + (g1 - g2) ^2 + (b1 - b2) ^2)) to generate the RGB vector distance between to pixels. It might also be interesting to accumulate a) total area of each color and b) largest rectangle (or other easily detected shape) of each color. The first case we would have from the pixel counts. The second case could be used to detect large areas of fill color. This might help classify a text message vs a map of the world or a picture of downtown Camaroon. Why? What does downtown Cameroon look like? ;-) It also might be interesting to accumulate statistics on the common color distributions for 10K or so legit images sent through email, possibly along with some sort of indication of purpose: picture of me, picture of my dog, billboard I saw, kids at Christmas, Hallmark greeting card, etc. But those aren't sent as multipart/alternative... because you want to see both the text and the images. The spammers send multipart/alternative because they want the text/plain section to confuse the Bayes filters, since they know it won't be rendered... With that info the color distribution might be able to help classify the image fairly cheaply. I don't know how much of the above would be absolutely necessary, but I suspect at least some of it is. Still, this is a fairly trivial sort of thing to have to accumulate. Expecially since all spam (at least currently) uses gifs, which a blind man can decode with a hair comb - no fancy software required. Loren Yup. Exactly. -Philip