Austin,

> > now hope to do this Thursday/Friday.  I should be able to scan my
> > million or so messages in a day on my cluster.
> 
> Wow, that makes me feel inadequate :)  I'm struggling to clean up my
> little ham sample of 3600 messages, and looking at another couple
> thousand that I'll do if I've got time...

Thanks, that will be nice to have. As the rulesqa site can distinguish
results based on a corpus submitter, even a small but carefully checked
collection is worth having.

I found it valuable to double check ham samples which fire rules
URIBL_JP_SURBL, URIBL_WS_SURBL, URIBL_OB_SURBL,
RCVD_IN_PBL, RCVD_IN_XBL, RCVD_IN_PSBL, RCVD_IN_SSBL

> Also, I need some advice, if someone can provide it.  I'm looking at a
> message (and I have several like this in my corpus at present) which
> generates the following log line
> 
> .  1 /home/gems/ham//cur/n8500ejj019591:2,S
> MISSING_DATE,MISSING_HEADERS,MISSING_MID,T_FSL_HELO_NON_FQDN_2,__DKIM_DEPEN
> DABLE,__DNS_FROM_RFC_ABUSE,__DOS_DIRECT_TO_MX,__DOS_HAS_ANY_URI,__DOS_RCVD_
> FRI,__DOS_SINGLE_EXT_RELAY,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_RCVD,__HAS_S
> UBJECT,__HAVE_BOUNCE_RELAYS,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_
> RELAY_NO_AUTH,__MISSING_REF,__MISSING_REPLY,__MISSING_THREAD,__NONEMPTY_BOD
> Y,__NUMBERS_IN_SUBJ,__RCVD_IN_2WEEKS,__RFC_IGNORANT_ENVFROM,__TO_NO_ARROWS_
> R,__TVD_BODY learn=ham,time=1252108840,scantime=1,format=f,reuse=no,set=1
> 
> It's clearly a poorly constructed message, but it's also clearly ham
> (it originated from an application that someone somewhere in my
> organization runs).  It had one header: Subject.  Then a body.  Should
> I leave stuff like this in?  I mean, it is ham, but...

I can't offer a definite answer (other comments are welcome), but I'd say
keep a few samples in your ham collection, but not in many copies.

  Mark

Reply via email to