Followup on my earlier message... On Monday 04 December 2006 11:11, Ian Turner wrote: > Yup. All of the FPs in my corpus are outlook messages with inline images. > But it turns out that some of those are also spam; the actual FP rate is
The actual FP rate, eliminating false false positives (e.g., after corpus cleaning) is 4 messages in 4773, or 0.08%. > That's what I'm trying to do, but this particular spammer seems to have > been very careful (or really used outlook to generate the message) -- it > seems to match exactly, at least at the MIME and RFC822 layers. I'm looking > into HTML now. A careful review of HTML messages from this class of spam and HTML messages from my corpus reveals nothing distinctive about the spam; the message template was almost certainly generated using Outlook Express itself. The rule I've already suggested (OE_MULTIPART_RELATED) is the most distinctive aspect I can find, barring any analysis of the image itself (which I leave to the ImageInfo or OCR plugins). Cheers, --Ian