From: "Chavdar Videff" <[EMAIL PROTECTED]> > Dear List, > > I know these are subject of the FAQ and the documentation, yet after I read > all of it I didn't get an answer to the following questions: > > 1. At our site we get approx. 1000 spam a week. Most of it is rated below 2.0 > points and gets through (even if we set required hits to 3 and 2 for certain > mailboxes). > > 2. Mail composed as HTML is rated as spam for the above reason. > > What can we do to improve the situation and boost the performance of SA. > > I assume that if we set required hits below 5.0, ham messages composed as HTML > will be rated as spam. However, the overwhelming number of spam rated below > 4, 3, 2 and even 1 points that we receive renders spamassassin useless for > our mail-server. > > We sort ham and spam and run sa-learn daily in order to train SA, we feed the > low-rated spam and ham that is not rated correctly to sa-learn without any > success: most messages (that are repeated) continue to go through. > > Please help. > > Why doesn't sa-learn help. We thought that if we submit to sa-learn a messages > that was mistaken, the next time a message that is the same or from the same > address will be sorted correctly. > > > Following is the configuration file (debian sid, sendmail, sitewide > configuration of SA). > > mail1:/home/chavdar# cat /etc/mail/spamassassin/local.cf > # This is the right place to customize your installation of SpamAssassin. > # > # See 'perldoc Mail::SpamAssassin::Conf' for details of what can be > # tweaked. > # > ########################################################################### > # > # rewrite_header Subject *****SPAM***** > # report_safe 1 > # trusted_networks 10.50 > # lock_method flock > > required_hits 3 > rewrite_subject 1 > report_header 1 > use_terse_report 1 > defang_mime 0 > report_safe 0 > use_bayes 1 > auto_learn 1 ^^^^^^^^^^
IMAO that is an utterly darned fool thing to use when coming up from a cold SpamAssassin start. I've found that a raw SpamAssassin install is wretched at filtering spam. Using autolearn at that time leads to the Bayes filter being very poorly trained. There, I got that off my shoulders. TPTB designing SpamAssassin disagree with me, obviously. My opinion comes from watching this list for a couple years or so now. Either auto_learn needs to default off or the spam/ham autolearn thresholds need to be dramatically changed. (I also note that a fluke in the scores configuration resulted in Bayes_99 having an absurdly low score in spite of its being nearly a perfect Spam sign on a well trained database. (I take it as an indication of a poorly trained "autolearned" database when they setup their entire scoring set.) My personal suggestions follow. 1) Nuke your current Bayes. 2) Install carefully selected SARE rule sets. Review and update your SARE rules regularly. Update weekly or more often. Review your rule sets against offerings at least once a month. 3) Turn off auto_learn or move the ham and spam thresholds for autolearn farther away from your spam threshold. (I just fail to see the logic. I consider it to be a tool for killing Bayes.) 4) Grit your teeth and use SURBL. (I don't like many black list policies. SURBL is quite honorable about theirs.) 5) Manually train on ham and spam per user with per user Bayes. (Shared Bayes is often less than useless. One person's desired porno mail is another person's extreme spam.) 6) NEVER delete spam. Forward it to the user marked as spam so that it can be eliminated after their review. Am ISP should include in the default email setup a spam folder with a rule that places the spam into that folder. Explain to them why you did this and how to change the folder destination into a simple delete it. 7) If you later reenable auto learn do so with extended thresholds. 8) If you manually train do it rigorously for the first few weeks then only train Bayes on spam if you happen to notice a low scoring spam that is not BAYES_99 and includes more than one or two lines of text. 9) Save all the ham and spam you used for training in case you have to rebuild the Bayes in the future. It saves time. 10) If you're in a multi-user environment make it as easy as possible for them to move an email from the incoming area to the spam or ham folders you provide for the user. Then have a script that performs automatic training at sane intervals. As part of training divert the spam, at least, to a spam database that can be used to retrain Bayes as quickly as possible in case of a glitch. Thus is the road to very low false negatives and false positives. I tend to get 700 to 1300 emails per day. If that about 1/3 are spam. (Hey, the Linux Kernel Mailing List and the Mandriva lists tend to be busy as does this one. It adds up to a lot of ham in a hurry for me. {^_-}) My FP and FN levels are on the order of one per thousand. FN is chiefly when a new spam address appears and the spammer uses new techniques to hide the spamminess. Pending SURBL catches I build a quick rule for it. My FP rate lives around 0 and 10 per thousand almost all from either patches or bug reports on LKML or the occasional AOL email that uses an email relay that is new and not in the test for legitimate AOL mail that is used to test for bogus spam containing AOL addresses. What I do about the FPs is simple. I sort all the spam into a spam folder. Then I sort by subject. Since the subject markup gives a three digit score it is easy for me to look at the first dozen or so entries to see if any of the low scoring spam was ham. Above about "12" here is never never land - I never see ham up there. (Or if it's ham I don't want to see it. {^_-})