Daniel Quinlan <[EMAIL PROTECTED]> writes:
> We can always just test it...
Okay, I tested it on my last 7 days of spam and ham (which I just
generated today).
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
4895 2868 2027 0.586 0.00 0.00 (all messages)
100.000 58.5904 41.4096 0.586 0.00 0.00 (all messages as %)
42.451 71.0948 1.9240 0.974 1.00 1.00 URIBL_SBL
0.204 0.3487 0.0000 1.000 0.97 0.01 T_URIBL_SC_SURBL
0.756 0.9763 0.4440 0.687 0.22 1.00 URIBL_DSBL
No FPs, but the SPAM% is rather low. I suspect the problem is that
SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL
mapping.
Also, my hits were largely confined to the last 4 days as expected
despite the corpus including the last 7 days of my spam:
first message in corpus: Fri Mar 19 23:11:07 2004
last message in corpus: Sun Mar 28 05:16:17 2004
hits:
Sun Mar 21 10:15:04 2004
Sun Mar 21 11:16:25 2004
Wed Mar 24 15:06:53 2004
Thu Mar 25 12:30:52 2004
Thu Mar 25 23:56:50 2004
Fri Mar 26 01:42:13 2004
Fri Mar 26 01:59:56 2004
Fri Mar 26 03:45:22 2004
Fri Mar 26 08:28:00 2004
Sat Mar 27 05:57:20 2004
distribution of messages in corpus:
count received date
23 Mar 19
360 Mar 20
335 Mar 21
369 Mar 22
324 Mar 23
372 Mar 24
390 Mar 25
398 Mar 26
295 Mar 27
2 Mar 28
This may or may not help with accuracy, but definitely will make delayed
testing harder.
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting