Re: On bichromatic GIF stock spam

2006-07-01 Thread Philip Prindeville
Loren Wilton wrote:

No, I was thinking of multipart/alternative where one of the
alternative streams is nothing but images. That doesn't strike me as
legitimate. Can anyone think of a scenario where images *are* a
legitimate alternative representation of text?



Doesn't really help.  The actual mails have a tiny gibberish text part, and
a tiny to medium html part that has a few words of gibberish (usually the
same as the text part) and the rest is calls to images.  So there really is
an html part.

I did a trivial test for alternative and gif, and it didn't pan out very
well.  Will need some additional conditions to make it more usable.

Loren

  


What Perl modules are there that can process (decode, perform certain
inspections and histogram analysis, etc) of GIF files?

I'd like to throw something together...

-Philip



Re: On bichromatic GIF stock spam

2006-06-26 Thread Loren Wilton
 No, I was thinking of multipart/alternative where one of the
 alternative streams is nothing but images. That doesn't strike me as
 legitimate. Can anyone think of a scenario where images *are* a
 legitimate alternative representation of text?

Doesn't really help.  The actual mails have a tiny gibberish text part, and
a tiny to medium html part that has a few words of gibberish (usually the
same as the text part) and the rest is calls to images.  So there really is
an html part.

I did a trivial test for alternative and gif, and it didn't pan out very
well.  Will need some additional conditions to make it more usable.

Loren



Re: On bichromatic GIF stock spam

2006-06-25 Thread John D. Hardin
On Sat, 24 Jun 2006, Philip Prindeville wrote:

 the text and the images.  The spammers send multipart/alternative
 because they want the text/plain section to confuse the Bayes
 filters, since they know it won't be rendered...

It seems to me that right there is the spam sign you should be looking
for, then, and save all the heavy-duty mathematical analysis of the
images themselves.

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...every time I sit down in front of a Windows machine I feel as
  if the computer is just a place for the manufacturers to put their
  advertising.  -- fwadling on Y! SCOX
--



Re: On bichromatic GIF stock spam

2006-06-25 Thread Philip Prindeville
John D. Hardin wrote:

On Sat, 24 Jun 2006, Philip Prindeville wrote:

  

the text and the images.  The spammers send multipart/alternative
because they want the text/plain section to confuse the Bayes
filters, since they know it won't be rendered...



It seems to me that right there is the spam sign you should be looking
for, then, and save all the heavy-duty mathematical analysis of the
images themselves.
  


A lot of mailers generate multipart/alternative legitimately, though if you
ask me sending both text/plain and text/html is bogus and no one should
configure their mailer to do that.

-Philip



Re: On bichromatic GIF stock spam

2006-06-25 Thread John D. Hardin
On Sun, 25 Jun 2006, Philip Prindeville wrote:

 John D. Hardin wrote:
 
 On Sat, 24 Jun 2006, Philip Prindeville wrote:
 
 The spammers send multipart/alternative
 because they want the text/plain section to confuse the Bayes
 filters, since they know it won't be rendered...
 
 It seems to me that right there is the spam sign you should be looking
 for, then, and save all the heavy-duty mathematical analysis of the
 images themselves.
 
 A lot of mailers generate multipart/alternative legitimately,

No, I was thinking of multipart/alternative where one of the
alternative streams is nothing but images. That doesn't strike me as
legitimate. Can anyone think of a scenario where images *are* a
legitimate alternative representation of text?

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...every time I sit down in front of a Windows machine I feel as
  if the computer is just a place for the manufacturers to put their
  advertising.  -- fwadling on Y! SCOX
--



Re: On bichromatic GIF stock spam

2006-06-25 Thread David B Funk
On Sun, 25 Jun 2006, John D. Hardin wrote:

 On Sun, 25 Jun 2006, Philip Prindeville wrote:

  John D. Hardin wrote:
 
  On Sat, 24 Jun 2006, Philip Prindeville wrote:
  
  The spammers send multipart/alternative
  because they want the text/plain section to confuse the Bayes
  filters, since they know it won't be rendered...
[snip..]

 No, I was thinking of multipart/alternative where one of the
 alternative streams is nothing but images. That doesn't strike me as
 legitimate. Can anyone think of a scenario where images *are* a
 legitimate alternative representation of text?

Sounds good in theory but difficult to implement. The HTML part is not
empty, contains comments, font control junk, and 'glue' to stitch together
those multiple fragment gifs. So you'd have to run it thru a html
parsing engine (al'a lynx or pine) to determine that the textural
components render down to nothing.

Here's what works for me; I wrote a collection of custom rules that
recognizes that particular HTML structure and gave it a small but
sufficient score. (sufficient in this case is enough to make up the
difference between my spam threshold and a BAYES_99 score but not so
large as to cause FPs for legit messages that also have that structure).
So that MIME structure + BAYES_99 == spam.
Then by keeping bayes reasonably well fed those things get hit
pretty reliably. That way network test (RBLS, Razor, DCC, etc) are
just icing on the cake.

Dave

-- 
Dave Funk  University of Iowa
dbfunk (at) engineering.uiowa.eduCollege of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include std_disclaimer.h
Better is not better, 'standard' is better. B{


Re: On bichromatic GIF stock spam

2006-06-25 Thread John D. Hardin
On Sun, 25 Jun 2006, David B Funk wrote:

 On Sun, 25 Jun 2006, John D. Hardin wrote:
 
  No, I was thinking of multipart/alternative where one of the
  alternative streams is nothing but images. That doesn't strike me as
  legitimate. Can anyone think of a scenario where images *are* a
  legitimate alternative representation of text?
 
 Sounds good in theory but difficult to implement. The HTML part is not
 empty, contains comments, font control junk, and 'glue' to stitch together
 those multiple fragment gifs.

D'oh! I forgot about the HTML glue... So it denegenerates to a
standard multipart/alternative text + html message. Rats.

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...every time I sit down in front of a Windows machine I feel as
  if the computer is just a place for the manufacturers to put their
  advertising.  -- fwadling on Y! SCOX
--



Re: On bichromatic GIF stock spam

2006-06-25 Thread Theo Van Dinter
On Sun, Jun 25, 2006 at 12:49:17PM -0600, Philip Prindeville wrote:
 No, I was thinking of multipart/alternative where one of the
 alternative streams is nothing but images. That doesn't strike me as
 legitimate. Can anyone think of a scenario where images *are* a
 legitimate alternative representation of text?

Sure, it's the same idea as having PDF as an alternate representation.  A
picture is worth a thousand words and all that.

However, with that said, the question/answer isn't actually telling
you anything useful in this situation.  You want to know whether or not
m/a parents w/ non-text children is a useful spam sign...

 Well, let's instrument it and see... run the spam v. ham numbers.

It's not bad (taking into account multipart/related children as well):

  1.909   2.3080   0.1.000   1.000.01  T_MULTIPART_ALT_NON_TEXT

-- 
Randomly Generated Tagline:
I'd rather get it right than get it done on Tuesday.
   - J. Michael Straczynski


pgpR2CteogFUt.pgp
Description: PGP signature


RE: On bichromatic GIF stock spam

2006-06-24 Thread Michael Scheidell

 -Original Message-
 From: Philip Prindeville [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, June 24, 2006 2:10 PM
 To: users@spamassassin.apache.org
 Subject: On bichromatic GIF stock spam
 
 
 I get a lot of spam that looks like:
 
 http://pastebin.com/729105
 
 on the alsa-devel mailing list, amongst others...  And 
 noticed the following.
 
 If you decompress the GIF file and decode it into a pixmap 
 image, then do a color histogram of the image, you notice two 
 things immediately.

Or feed it through character recognition software and then replace the
gif attachment with a plain text attachment and reinject it back into
SA.



Re: On bichromatic GIF stock spam

2006-06-24 Thread Philip Prindeville
Michael Scheidell wrote:

-Original Message-
From: Philip Prindeville [mailto:[EMAIL PROTECTED] 
Sent: Saturday, June 24, 2006 2:10 PM
To: users@spamassassin.apache.org
Subject: On bichromatic GIF stock spam


I get a lot of spam that looks like:

http://pastebin.com/729105

on the alsa-devel mailing list, amongst others...  And 
noticed the following.

If you decompress the GIF file and decode it into a pixmap 
image, then do a color histogram of the image, you notice two 
things immediately.



Or feed it through character recognition software and then replace the
gif attachment with a plain text attachment and reinject it back into
SA.
  


Well, yeah, and that's already been discussed...  I wanted an alternative
to that that might be less CPU intensive.

-Philip



Re: On bichromatic GIF stock spam

2006-06-24 Thread Loren Wilton
 If, after excluding black, we find that 100% of the color map is that
 nasty pastel pink or pastel lime green (etc) then it's a spam and we
 toss it.

 Sound reasonable?

I was thinking about this the other day.  I think the concept is reasonable,
but as stated doesn't go far enough, and would be trivial to bypass.

I think that someone first needs to come up with either a formula or a list
of RGB triples that are visually indistinguishable or some such.  (I
suspect this has been done several times now and the research should exist
in the wild.)

This can then be used as a fuzz to group colors that are very close down
into a common bucket.  As it is, trivial 1-bit variations on colors would
defeat the simple scheme.

It might also be interesting to accumulate a) total area of each color and
b) largest rectangle (or other easily detected shape) of each color.  The
first case we would have from the pixel counts.  The second case could be
used to detect large areas of fill color.  This might help classify a text
message vs a map of the world or a picture of downtown Camaroon.

It also might be interesting to accumulate statistics on the common color
distributions for 10K or so legit images sent through email, possibly along
with some sort of indication of purpose: picture of me, picture of my
dog, billboard I saw, kids at Christmas, Hallmark greeting card, etc.

With that info the color distribution might be able to help classify the
image fairly cheaply.

I don't know how much of the above would be absolutely necessary, but I
suspect at least some of it is.  Still, this is a fairly trivial sort of
thing to have to accumulate.  Expecially since all spam (at least currently)
uses gifs, which a blind man can decode with a hair comb - no fancy software
required.

Loren



Re: On bichromatic GIF stock spam

2006-06-24 Thread Philip Prindeville
Loren Wilton wrote:

If, after excluding black, we find that 100% of the color map is that
nasty pastel pink or pastel lime green (etc) then it's a spam and we
toss it.

Sound reasonable?



I was thinking about this the other day.  I think the concept is reasonable,
but as stated doesn't go far enough, and would be trivial to bypass.

I think that someone first needs to come up with either a formula or a list
of RGB triples that are visually indistinguishable or some such.  (I
suspect this has been done several times now and the research should exist
in the wild.)

This can then be used as a fuzz to group colors that are very close down
into a common bucket.  As it is, trivial 1-bit variations on colors would
defeat the simple scheme.
  


Shh they might be listening... ;-)

Seriously, though, how many people send out 2-color GIFs (besides
BW scans of Dilbert and faxes) as email?

The formula is:

sqrt((r1 - r2) ^2 + (g1 - g2) ^2 + (b1 - b2) ^2))

to generate the RGB vector distance between to pixels.


It might also be interesting to accumulate a) total area of each color and
b) largest rectangle (or other easily detected shape) of each color.  The
first case we would have from the pixel counts.  The second case could be
used to detect large areas of fill color.  This might help classify a text
message vs a map of the world or a picture of downtown Camaroon.
  


Why?  What does downtown Cameroon look like?  ;-)

It also might be interesting to accumulate statistics on the common color
distributions for 10K or so legit images sent through email, possibly along
with some sort of indication of purpose: picture of me, picture of my
dog, billboard I saw, kids at Christmas, Hallmark greeting card, etc.
  


But those aren't sent as multipart/alternative... because you want to
see both
the text and the images.  The spammers send multipart/alternative because
they want the text/plain section to confuse the Bayes filters, since
they know
it won't be rendered...

With that info the color distribution might be able to help classify the
image fairly cheaply.

I don't know how much of the above would be absolutely necessary, but I
suspect at least some of it is.  Still, this is a fairly trivial sort of
thing to have to accumulate.  Expecially since all spam (at least currently)
uses gifs, which a blind man can decode with a hair comb - no fancy software
required.

Loren
  



Yup.  Exactly.

-Philip