False positives and Bayes

Justin Lloyd Thu, 24 Aug 2006 14:51:42 -0700

Title: False positives and Bayes

Hello, all.

A couple of months ago I built new mail servers to replace our existing ones that had aging mail configurations (and disparate OS configurations), running sendmail 8.12.6 and SA 3.0.2. Our configuration now consists of 2 RHEL 4 ES servers that share the load using DNS round-robin, running sendmail 8.13.7 and SpamAssassin 3.1.3, and we are running sa-update and rulesdujour nightly (though actual updates are rare). We use spamass-milter 0.31, which we have configured to drop spams with scores >= 10, thereby dropping about 75% of the incoming email before it gets to our Exchange servers. Speaking of which, these servers do not deliver mail locally, rather all received mail either goes to internal MS Exchange servers or Linux helpdesk and mailing list servers. Also, our company is about 350 people and we receive a good deal of legitimate international email.

Here is our SA configuration from /etc/mail/spamassassin/local.cf:

required_score 5

rewrite_header Subject *** SPAM [_SCORE_] ***

report_safe 0

dcc_path /usr/local/bin/dccproc

razor_config /etc/mail/spamassassin/.razor/razor-agent.conf

dns_available yes

bayes_path /localhost/home/spamd/bayes

bayes_auto_learn_threshold_spam 30

bayes_auto_learn_threshold_nonspam -0.1

bayes_min_ham_num 100000

bayes_min_spam_num 100000

auto_whitelist_path /localhost/home/spamd/auto-whitelist

include /etc/mail/spamassassin/whitelist

include /etc/mail/spamassassin/blacklist

Here are the statistics from both mail servers for the past 31 days:

Email: 1303815 Autolearn: 608540 AvgScore: 12.23 AvgScanTime: 1.38 sec

Spam: 745609 Autolearn: 139632 AvgScore: 23.36 AvgScanTime: 1.52 sec

Ham: 558206 Autolearn: 468908 AvgScore: -2.63 AvgScanTime: 1.20 sec

Email: 945103 Autolearn: 284139 AvgScore: 15.33 AvgScanTime: 1.46 sec

Spam: 701327 Autolearn: 131994 AvgScore: 22.30 AvgScanTime: 1.46 sec

Ham: 243776 Autolearn: 152145 AvgScore: -4.74 AvgScanTime: 1.44 sec

(We think the disparity in mail counts between the two is due to some senders having cached or hard-coded the first one’s IP address and using it rather than MX lookups like normal people do.)

The major problem we are seeing is a number of false positives in the 6-8 point range due to 3.5 points from BAYES_99 on messages that should not be hitting that rule from what we can see. One thing we’ve noticed is that many such messages are from mailing lists and newsletters and from ISPs that shall remain nameless, though many of these also score high due to several rfc-ignorant rules being hit.

We have turned off Bayes in the past (before the upgrade) and are debating doing so again, but first we decided to see what constructive criticism and advice the SA community may have regarding our configuration. Please let me know if any additional information would be useful.

Thanks,

Justin C. Lloyd

Senior Engineer and System Administrator

303-684-4166 Office

720-480-0380 Cell

303-684-4100 Fax

[EMAIL PROTECTED]

DigitalGlobe ®, An Imaging and Information Company

False positives and Bayes

Reply via email to