Loren Wilton wrote: >>If, after excluding black, we find that 100% of the color map is that >>nasty pastel pink or pastel lime green (etc) then it's a spam and we >>toss it. >> >>Sound reasonable? >> >> > >I was thinking about this the other day. I think the concept is reasonable, >but as stated doesn't go far enough, and would be trivial to bypass. > >I think that someone first needs to come up with either a formula or a list >of RGB triples that are "visually indistinguishable" or some such. (I >suspect this has been done several times now and the research should exist >in the wild.) > >This can then be used as a fuzz to group colors that are very close down >into a common bucket. As it is, trivial 1-bit variations on colors would >defeat the simple scheme. > >
Shhhhhh.... they might be listening... ;-) Seriously, though, how many people send out 2-color GIFs (besides B&W scans of Dilbert and faxes) as email? The formula is: sqrt((r1 - r2) ^2 + (g1 - g2) ^2 + (b1 - b2) ^2)) to generate the RGB vector distance between to pixels. >It might also be interesting to accumulate a) total area of each color and >b) largest rectangle (or other easily detected shape) of each color. The >first case we would have from the pixel counts. The second case could be >used to detect large areas of fill color. This might help classify a text >message vs a map of the world or a picture of downtown Camaroon. > > Why? What does downtown Cameroon look like? ;-) >It also might be interesting to accumulate statistics on the common color >distributions for 10K or so legit images sent through email, possibly along >with some sort of indication of purpose: "picture of me", "picture of my >dog", "billboard I saw", "kids at Christmas", "Hallmark greeting card", etc. > > But those aren't sent as multipart/alternative... because you want to see both the text and the images. The spammers send multipart/alternative because they want the text/plain section to confuse the Bayes filters, since they know it won't be rendered... >With that info the color distribution might be able to help classify the >image fairly cheaply. > >I don't know how much of the above would be absolutely necessary, but I >suspect at least some of it is. Still, this is a fairly trivial sort of >thing to have to accumulate. Expecially since all spam (at least currently) >uses gifs, which a blind man can decode with a hair comb - no fancy software >required. > > Loren > > Yup. Exactly. -Philip