Loren Wilton wrote:

>>If, after excluding black, we find that 100% of the color map is that
>>nasty pastel pink or pastel lime green (etc) then it's a spam and we
>>toss it.
>>
>>Sound reasonable?
>>    
>>
>
>I was thinking about this the other day.  I think the concept is reasonable,
>but as stated doesn't go far enough, and would be trivial to bypass.
>
>I think that someone first needs to come up with either a formula or a list
>of RGB triples that are "visually indistinguishable" or some such.  (I
>suspect this has been done several times now and the research should exist
>in the wild.)
>
>This can then be used as a fuzz to group colors that are very close down
>into a common bucket.  As it is, trivial 1-bit variations on colors would
>defeat the simple scheme.
>  
>

Shhhhhh.... they might be listening... ;-)

Seriously, though, how many people send out 2-color GIFs (besides
B&W scans of Dilbert and faxes) as email?

The formula is:

sqrt((r1 - r2) ^2 + (g1 - g2) ^2 + (b1 - b2) ^2))

to generate the RGB vector distance between to pixels.


>It might also be interesting to accumulate a) total area of each color and
>b) largest rectangle (or other easily detected shape) of each color.  The
>first case we would have from the pixel counts.  The second case could be
>used to detect large areas of fill color.  This might help classify a text
>message vs a map of the world or a picture of downtown Camaroon.
>  
>

Why?  What does downtown Cameroon look like?  ;-)

>It also might be interesting to accumulate statistics on the common color
>distributions for 10K or so legit images sent through email, possibly along
>with some sort of indication of purpose: "picture of me", "picture of my
>dog", "billboard I saw", "kids at Christmas", "Hallmark greeting card", etc.
>  
>

But those aren't sent as multipart/alternative... because you want to
see both
the text and the images.  The spammers send multipart/alternative because
they want the text/plain section to confuse the Bayes filters, since
they know
it won't be rendered...

>With that info the color distribution might be able to help classify the
>image fairly cheaply.
>
>I don't know how much of the above would be absolutely necessary, but I
>suspect at least some of it is.  Still, this is a fairly trivial sort of
>thing to have to accumulate.  Expecially since all spam (at least currently)
>uses gifs, which a blind man can decode with a hair comb - no fancy software
>required.
>
>        Loren
>  
>


Yup.  Exactly.

-Philip


Reply via email to