The following is an apples to apples comparisons of DNSBL lastexternal rules against the October 10th, 2009 weekly_mass_check corpora. HOSTKARMA and SEM are new. Hopefully these masscheck results can help to identify problems so list quality can improve over time.

http://ruleqa.spamassassin.org/20091010-r823821-n
128161 Spam
185199 Ham

The results below are only as good as the data submitted by nightly masscheck volunteers. Please join us in nightly masschecks to increase the sample size of the corpora so we can have greater confidence in the nightly statistics.

============================
DNSBL lastexternal by Safety
============================
SPAM%    HAM%    RANK RULE
10.0975% 0.0022% 0.93 RCVD_IN_PSBL
11.4278% 0.0173% 0.91 RCVD_IN_XBL
18.7561% 0.0616% 0.87 RCVD_IN_SEMBLACK
81.8252% 0.1825% 0.83 RCVD_IN_PBL
27.4342% 0.2327% 0.77 RCVD_IN_SORBS_DUL
91.5505% 0.3974% 0.76 RCVD_IN_BRBL_LASTEXT
13.1272% 0.5027% 0.67 RCVD_IN_HOSTKARMA_BL

RANK is heavily influenced by the false positive rate, thus it seems to be a rough approximation of safety. RANK alone says little about the effectiveness of a particular rule against spam. These numbers show that Barracuda and PBL are by far the most extensive blacklists, but the false positive rates suggest that Barracuda is aggressive at the expense of safety. Given that zen.spamhaus.org is a combination of XBL and PBL, this data seems to confirm the good reputation of Spamhaus.

Overlap analysis shows the majority of XBL and PBL are also listed by Barracuda. Furthermore Barracuda's list seems to have a similar hit % as XBL + PBL combined. Is Barracuda known to aggregate Spamhaus data with their own? If so we might be adding redundant scores in a dangerous and undesirable manner.

Adam Katz sa-update channels contains DNSBL rule overlap adjustments in an attempt to compensate for what he calls "incestuous" blacklists. I am beginning to think this is a good idea to explore for spamassassin upstream if in fact one blacklist is aggregating data from another blacklist.

http://ruleqa.spamassassin.org/20091010-r823821-n/
In related news, these results indicate that RCVD_IN_HOSTKARMA_BR and RCVD_IN_SEMBACKSCATTER have so few hits that they are likely not worth the overhead of the extra DNS query to use in production. Unless the list owners object, I will remove them from the sandbox before next Saturday's network masscheck.

=======
Spamcop
=======
SPAM%    HAM%    RANK RULE
16.8663% 2.5994% 0.56 RCVD_IN_BL_SPAMCOP_NET

I did not include SpamCop in the above chart because it is not the same type of lastexternal DNSBL. I'm confused. With such a poor false positive rate how does it have a high score generated by the GA?

Warren Togami
wtog...@redhat.com

Reply via email to