Re: false positives and negatives

jdow Mon, 30 May 2005 20:21:30 -0700

From: "Chavdar Videff" <[EMAIL PROTECTED]>

> Dear List,
>
> I know these are subject of the FAQ and the documentation, yet after I
read
> all of it I didn't get an answer to the following questions:
>
> 1. At our site we get approx. 1000 spam a week. Most of it is rated below
2.0
> points and gets through (even if we set required hits to 3 and 2 for
certain
> mailboxes).
>
> 2. Mail composed as HTML is rated as spam for the above reason.
>
> What can we do to improve the situation and boost the performance of SA.
>
> I assume that if we set required hits below 5.0, ham messages composed as
HTML
> will be rated as spam. However, the overwhelming number of spam rated
below
> 4, 3, 2 and even 1 points that we receive renders spamassassin useless for
> our mail-server.
>
> We sort ham and spam and run sa-learn daily in order to train SA, we feed
the
> low-rated spam and ham that is not rated correctly to sa-learn without any
> success: most messages (that are repeated) continue to go through.
>
> Please help.
>
> Why doesn't sa-learn help. We thought that if we submit to sa-learn a
messages
> that was mistaken, the next time a message that is the same or from the
same
> address will be sorted correctly.
>
>
> Following is the configuration file (debian sid, sendmail, sitewide
> configuration of SA).
>
> mail1:/home/chavdar# cat /etc/mail/spamassassin/local.cf
> # This is the right place to customize your installation of SpamAssassin.
> #
> # See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
> # tweaked.
> #
>
###########################################################################
> #
> # rewrite_header Subject *****SPAM*****
> # report_safe 1
> # trusted_networks 10.50
> # lock_method flock
>
> required_hits 3
> rewrite_subject 1
> report_header 1
> use_terse_report 1
> defang_mime 0
> report_safe 0
> use_bayes 1
> auto_learn 1
  ^^^^^^^^^^


IMAO that is an utterly darned fool thing to use when coming up from a
cold SpamAssassin start. I've found that a raw SpamAssassin install is
wretched at filtering spam. Using autolearn at that time leads to the
Bayes filter being very poorly trained. There, I got that off my
shoulders. TPTB designing SpamAssassin disagree with me, obviously.
My opinion comes from watching this list for a couple years or so now.
Either auto_learn needs to default off or the spam/ham autolearn
thresholds need to be dramatically changed. (I also note that a fluke
in the scores configuration resulted in Bayes_99 having an absurdly
low score in spite of its being nearly a perfect Spam sign on a well
trained database. (I take it as an indication of a poorly trained
"autolearned" database when they setup their entire scoring set.)

My personal suggestions follow.

1) Nuke your current Bayes.
2) Install carefully selected SARE rule sets. Review and update your
   SARE rules regularly. Update weekly or more often. Review your rule
   sets against offerings at least once a month.
3) Turn off auto_learn or move the ham and spam thresholds for autolearn
   farther away from your spam threshold. (I just fail to see the logic.
   I consider it to be a tool for killing Bayes.)
4) Grit your teeth and use SURBL. (I don't like many black list policies.
   SURBL is quite honorable about theirs.)
5) Manually train on ham and spam per user with per user Bayes. (Shared
   Bayes is often less than useless. One person's desired porno mail is
   another person's extreme spam.)
6) NEVER delete spam. Forward it to the user marked as spam so that it
   can be eliminated after their review. Am ISP should include in the
   default email setup a spam folder with a rule that places the spam
   into that folder. Explain to them why you did this and how to change
   the folder destination into a simple delete it.
7) If you later reenable auto learn do so with extended thresholds.
8) If you manually train do it rigorously for the first few weeks then
   only train Bayes on spam if you happen to notice a low scoring spam
   that is not BAYES_99 and includes more than one or two lines of text.
9) Save all the ham and spam you used for training in case you have to
   rebuild the Bayes in the future. It saves time.
10) If you're in a multi-user environment make it as easy as possible for
   them to move an email from the incoming area to the spam or ham folders
   you provide for the user. Then have a script that performs automatic
   training at sane intervals. As part of training divert the spam, at
   least, to a spam database that can be used to retrain Bayes as quickly
   as possible in case of a glitch.

Thus is the road to very low false negatives and false positives. I tend
to get 700 to 1300 emails per day. If that about 1/3 are spam. (Hey, the
Linux Kernel Mailing List and the Mandriva lists tend to be busy as does
this one. It adds up to a lot of ham in a hurry for me. {^_-}) My FP and
FN levels are on the order of one per thousand. FN is chiefly when a new
spam address appears and the spammer uses new techniques to hide the
spamminess. Pending SURBL catches I build a quick rule for it. My FP rate
lives around 0 and 10 per thousand almost all from either patches or bug
reports on LKML or the occasional AOL email that uses an email relay
that is new and not in the test for legitimate AOL mail that is used to
test for bogus spam containing AOL addresses.

What I do about the FPs is simple. I sort all the spam into a spam folder.
Then I sort by subject. Since the subject markup gives a three digit
score it is easy for me to look at the first dozen or so entries to see
if any of the low scoring spam was ham. Above about "12" here is never
never land - I never see ham up there. (Or if it's ham I don't want to
see it. {^_-})

Re: false positives and negatives

Reply via email to