At 10:26 PM 8/21/2006 -0700, John Rudd wrote: >I also heard that interlaced gif spam is appearing now.
Yes, I saw that post, however there wasn't a publicly available sample. Any such would be much appreciated. >It'd be interesting to see how to counter them. Should be easy. One approach is "pixel density". What I've been doing is reading JUST enough of the header to calculate the area (just like Dallas' excellent ImageInfo plugin), then dividing by the total raw file size of just the image (i.e. what one gets after base64 decoding just the GIF part), less the size of the obvious parts of the header. Works well, and is blindingly fast. Ham generally have a much LOWER density, because it's typically clipart, whereas spam is generally text, which compresses extremely well, resulting in a much HIGHER density. It's not fool proof, so I use a sliding scale, and have had only one FP this month (from an idiot (redundant) recruiter to one of my testers - the PNG misfiring was only half the points required to reject, and the able idiot managed to do several other things rare in Ham). The beauty is that the spammer can "easily" foil this by lowerering the density by adding more complexity, which increases the file size, so more bandwidth is consumed. :) Some stock spams do use a fancier font which scores lower, so I'm still considering other types of analysis as a backup. Specifically to address animated GIFs, it would be very easy to "walk" the raw image, calculating each frame's pixel density, simply ignoring the obvious chaff frames. Tomorrow, I'll write some code to decompose the frames and see what sort of numbers I get. >For interlaced ... I have no idea. Depends a lot on how the interlaced >images are stored, I guess. Yes, exactly. Until there's samples, I'm not going to worry about it. What we also need is a diverse Ham GIF corpus. Does anyone know of one? - "Chip" P.S. Dallas: it never occurred to me to _JUST_ score the area. My pixel density approach fails on multi-GIFs, so you saved my bacon there. ;)