Re: spam samples for bayes training

Christopher Bort Thu, 31 May 2007 14:05:41 -0700

On 05/31/07 11:31, [EMAIL PROTECTED] (Lewis Butler) wrote:

On 30-May-2007, at 13:16, Christopher Bort wrote:
It is highly recommended that you train your Bayes databaseonly with messages that have actually been received at yourown installation. Using someone else's spam and ham is likelyskew your database and result in inaccuracies. In other words,SpamAssassin needs to know what _you_ see as spam and ham, notwhat someone else sees.
Spam AND ham, yes, but there is nothing wrong with usingsomeone else's spamarchive to help train up your bayes. Whiledefenitions of ham vary widely, the same cannot be said for spam.

Sure it can. I've seen a lot of different definitions of 'spam,''UCE,' 'UBE,' etc. They don't all agree on everything, startingwith what to call it. Even within a given definition, what is oris not spam can be subjective. As a case in point, I work for acompany that publishes real estate advertising magazines. Ouradvertisers routinely send us ad copy by e-mail. In manycontexts, much of it would look quite spammy. In our context itis not and I need to make certain that such messages are notclassified as spam. Also, our sales reps get a lot of e-mailadvertisements and 'newsletters' all about the latest greatestthings happening in real estate, mortgages and generally bilkingpeople out of as much money as possible for houses that areseveral times bigger than what they need. To me, a great deal ofit is absolutely spam. To our sales people who are receiving it,it's keeping up with trends in the market to which they sell(i.e. realtors). As much as the stuff might turn my stomach, alot of it I cannot have SA classifying as spam.

The point is still that the spam you train SpamAssassin withmust reflect the nature of spam that your installation receives.Can you get away with using someone else's 'generic' spamarchive for training? Sure, but you would be well advised tofirst review it thoroughly to make sure that it fairly closelymatches the spam you need to catch on your own server(s).

Another thing that has been implied in this thread, but notstated explicitly, is that Bayes training needs to continue on aregular basis beyond the initial training period. The mix ofspam and ham that hits a given server now may not be exactly thesame as what that server will see six months from now, sotraining needs to keep up with what is currently being received.

The best thing for accuracy is keeping SpamAssassin up to date(I am right now updating to 3.2), keeping its rules updated,and running some of the other rules sets. I use RulesDuJourmyself, but read-up on the wiki:
<http://wiki.apache.org/spamassassin/CustomRulesets>
<http://www.exit0.us/index.php?pagename=RulesDuJour >

Yes, I recommended using additional rulesets and RulesDuJour inmy previous post, as well as regular use of sa-update to keepSA's default rules up to date. Good point about keeping SAitself up to date. Subscribing to the SpamAssassin-Announce listis a good way to know when updates are available.

- Don't necessarily just accept the default spam threshold of5.0. If you're getting too many false results, adjust thethreshold in small increments and wait a while betweenadjustments to make sure that you don't get more false resultsthan you can tolerate.
I have to disagree. I never adjust my threshold. I throwthings at bayes until they 'stick' and I may, on rareoccasions, adjust the score for a rule.

Note that I said 'don't necessarily.' By that I meant that 5.0may work for a lot of installations, but it is something to keeptrack of because adjusting it might be beneficial. This justpoints out once again that everyone's installation is differentenough that there's no one-size-fits-all strategy. When I firstinstalled SpamAssassin, before I had a well-trained Bayesdatabase and before adding any custom rulesets, I got a lot offalse positives with the threshold at 5.0 points, at leastpartly due to the circumstances described above. I raised thethreshold to, IIRC, 7.5 or so to eliminate the false positives.Of course, that introduced a lot of false negatives. As Itrained my Bayes database so that it became more finely tuned tomy server's view of both spam and ham, and added a few rulesets,I gradually ratcheted the threshold downward. It's currently at4.9. I get a fair amount of spam that scores around 5.0 or so,but only the occasional piece that scores at 4.9, so I've keptthe threshold steady there for a while now.


--
Christopher Bort
Homes Magazine
email: <[EMAIL PROTECTED]>
website: <http://www.homesmagazine.com/>
FAX: 775-284-1298
Phone: 775-284-1294

Real Estate Advertising/ Web Products/ Digital Printing Services

Serving: Wine Country Napa & Sonoma County, Marin County, SanFrancisco BayArea, Santa Cruz County, Monterey County , San Luis ObispoCounty & SantaBarbara County, Reno/Sparks & Carson Valley, North Lake Tahoe &Truckee &

South Lake Tahoe


#############################################################
This message is sent to you because you are subscribed to
 the mailing list <SIMS@mail.stalker.com>.
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>

Re: spam samples for bayes training

Reply via email to