RE: False positives and Bayes

Justin Lloyd Fri, 25 Aug 2006 07:33:37 -0700

We have an Exchange SpamAssassin folder that our users can drop false
negatives into. Then I periodically run a Perl script (using
Mail::IMAPClient) to retrieve the messages and retrain both mail servers
with those (not just the mail server through which the message arrived).


Whenever I receive a report of a false positive, I generally visit the
user and review the message, in case there is some other problem that
could be resolved or to determine if whitelisting would be appropriate,
before having them put it in another Exchange folder, which I then use
to retrain both mail servers as well.

As for the mailing lists, I generally have been avoiding whitelisting
those and instead trying to rely on retraining to get such messages to
not get tagged on their own merits. So far it seems to be working. False
positives on personal emails are more of an issue for us than those from
mailing lists.

Justin

-----Original Message-----
From: Anthony Peacock [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 25, 2006 2:25 AM
To: users@spamassassin.apache.org
Subject: Re: False positives and Bayes

Hi,

Justin Lloyd wrote:
> Hello, all.
> 
> A couple of months ago I built new mail servers to replace our
existing
> ones that had aging mail configurations (and disparate OS
> configurations), running sendmail 8.12.6 and SA 3.0.2. Our
configuration
> now consists of 2 RHEL 4 ES servers that share the load using DNS
> round-robin, running sendmail 8.13.7 and SpamAssassin 3.1.3, and we
are
> running sa-update and rulesdujour nightly (though actual updates are
> rare). We use spamass-milter 0.31, which we have configured to drop
> spams with scores >= 10, thereby dropping about 75% of the incoming
> email before it gets to our Exchange servers. Speaking of which, these
> servers do not deliver mail locally, rather all received mail either
> goes to internal MS Exchange servers or Linux helpdesk and mailing
list
> servers. Also, our company is about 350 people and we receive a good
> deal of legitimate international email.
> 
> Here is our SA configuration from /etc/mail/spamassassin/local.cf:
> 
> required_score 5
> rewrite_header Subject *** SPAM [_SCORE_] ***
> report_safe 0
> dcc_path /usr/local/bin/dccproc
> razor_config /etc/mail/spamassassin/.razor/razor-agent.conf
> dns_available yes
> bayes_path /localhost/home/spamd/bayes
> bayes_auto_learn_threshold_spam      30
> bayes_auto_learn_threshold_nonspam   -0.1
> bayes_min_ham_num  100000
> bayes_min_spam_num 100000
> auto_whitelist_path /localhost/home/spamd/auto-whitelist
> include /etc/mail/spamassassin/whitelist
> include /etc/mail/spamassassin/blacklist
> 
> Here are the statistics from both mail servers for the past 31 days:
>       
> Email:  1303815  Autolearn: 608540  AvgScore:  12.23  AvgScanTime:
1.38
> sec
> Spam:    745609  Autolearn: 139632  AvgScore:  23.36  AvgScanTime:
1.52
> sec
> Ham:     558206  Autolearn: 468908  AvgScore:  -2.63  AvgScanTime:
1.20
> sec
> 
> Email:   945103  Autolearn: 284139  AvgScore:  15.33  AvgScanTime:
1.46
> sec
> Spam:    701327  Autolearn: 131994  AvgScore:  22.30  AvgScanTime:
1.46
> sec
> Ham:     243776  Autolearn: 152145  AvgScore:  -4.74  AvgScanTime:
1.44
> sec
> 
> (We think the disparity in mail counts between the two is due to some
> senders having cached or hard-coded the first one's IP address and
using
> it rather than MX lookups like normal people do.)
> 
> The major problem we are seeing is a number of false positives in the
> 6-8 point range due to 3.5 points from BAYES_99 on messages that
should
> not be hitting that rule from what we can see. One thing we've noticed
> is that many such messages are from mailing lists and newsletters and
> from ISPs that shall remain nameless, though many of these also score
> high due to several rfc-ignorant rules being hit.
> 
> We have turned off Bayes in the past (before the upgrade) and are
> debating doing so again, but first we decided to see what constructive
> criticism and advice the SA community may have regarding our
> configuration. Please let me know if any additional information would
be
> useful.

How do you train your Bayes database?

You should be feeding the false positives back using sa-learn as ham, so

that the Bayes scorer learns that these are not spam.  I manually train 
Bayes with false positives and false negatives on a regular basis.

You probably should also be looking at whitelisting some of the mailing 
lists.  When the manual training really doesn't convinve Bayes that the 
spammy looking maling lists messages are ham I add those lists to one of

the whitelists.

-- 
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:    http://www.chime.ucl.ac.uk/~rmhiajp/
"If you have an apple and I have  an apple and we  exchange apples
then you and I will still each have  one apple. But  if you have an
idea and I have an idea and we exchange these ideas, then each of us
will have two ideas." -- George Bernard Shaw

RE: False positives and Bayes

Reply via email to