Austin, > > now hope to do this Thursday/Friday. I should be able to scan my > > million or so messages in a day on my cluster. > > Wow, that makes me feel inadequate :) I'm struggling to clean up my > little ham sample of 3600 messages, and looking at another couple > thousand that I'll do if I've got time...
Thanks, that will be nice to have. As the rulesqa site can distinguish results based on a corpus submitter, even a small but carefully checked collection is worth having. I found it valuable to double check ham samples which fire rules URIBL_JP_SURBL, URIBL_WS_SURBL, URIBL_OB_SURBL, RCVD_IN_PBL, RCVD_IN_XBL, RCVD_IN_PSBL, RCVD_IN_SSBL > Also, I need some advice, if someone can provide it. I'm looking at a > message (and I have several like this in my corpus at present) which > generates the following log line > > . 1 /home/gems/ham//cur/n8500ejj019591:2,S > MISSING_DATE,MISSING_HEADERS,MISSING_MID,T_FSL_HELO_NON_FQDN_2,__DKIM_DEPEN > DABLE,__DNS_FROM_RFC_ABUSE,__DOS_DIRECT_TO_MX,__DOS_HAS_ANY_URI,__DOS_RCVD_ > FRI,__DOS_SINGLE_EXT_RELAY,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_RCVD,__HAS_S > UBJECT,__HAVE_BOUNCE_RELAYS,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_ > RELAY_NO_AUTH,__MISSING_REF,__MISSING_REPLY,__MISSING_THREAD,__NONEMPTY_BOD > Y,__NUMBERS_IN_SUBJ,__RCVD_IN_2WEEKS,__RFC_IGNORANT_ENVFROM,__TO_NO_ARROWS_ > R,__TVD_BODY learn=ham,time=1252108840,scantime=1,format=f,reuse=no,set=1 > > It's clearly a poorly constructed message, but it's also clearly ham > (it originated from an application that someone somewhere in my > organization runs). It had one header: Subject. Then a body. Should > I leave stuff like this in? I mean, it is ham, but... I can't offer a definite answer (other comments are welcome), but I'd say keep a few samples in your ham collection, but not in many copies. Mark