Followup on my earlier message...

On Monday 04 December 2006 11:11, Ian Turner wrote:
> Yup. All of the FPs in my corpus are outlook messages with inline images.
> But it turns out that some of those are also spam; the actual FP rate is

The actual FP rate, eliminating false false positives (e.g., after corpus 
cleaning) is 4 messages in 4773, or 0.08%.

> That's what I'm trying to do, but this particular spammer seems to have
> been very careful (or really used outlook to generate the message) -- it
> seems to match exactly, at least at the MIME and RFC822 layers. I'm looking
> into HTML now.

A careful review of HTML messages from this class of spam and HTML messages 
from my corpus reveals nothing distinctive about the spam; the message 
template was almost certainly generated using Outlook Express itself. The 
rule I've already suggested (OE_MULTIPART_RELATED) is the most distinctive 
aspect I can find, barring any analysis of the image itself (which I leave to 
the ImageInfo or OCR plugins).

Cheers,

--Ian

Reply via email to