Re: bayes learn best practice

John Hardin Thu, 09 Apr 2009 08:05:42 -0700

On Thu, 9 Apr 2009, Arthur Kerpician wrote:

I tried to manually keep both spam and ham at the same level inthe bayes db but it seems that spamassassin is learning spam twice asfast as ham.

Not surprising, as raw email traffic has a very skewed spam:ham ratio.Surely you've heard the stats that "90% of all email is spam"?

The docs mention that after 5000 spam and ham learned, spamassassindoesn't improve spam detection much. What is the best practice tooptimize the bayes detection? Should I stop auto-learning after reachingthe 5000 mark and than re-train from time to time from scratch?

I'll let others comment on issues like disk space and scan time w/r/tbayes database size. For myself, I have a _very_ small userbase and dopurely manual training with a small corpus. I have under 3000 tokenstotal and get good results.

Build good representative ham and spam corpa, and train any misses (FPsand FNs) going forward. Retain those messages. Unfortunately autolearndoesn't let you retain those messages.

Retraining from scratch is only really necessary if things have gonecompletely out of whack, and at that point you review your corpa carefullyfor misclassified messages, wipe and retrain. Bayes should only go bonkersif you have people manually training messages incorrectly, or (not toolikely) if autolearn has taken a slightly-poor configuration and magnifiedthe errors.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Gun Control enables genocide while doing little to reduce crime.
-----------------------------------------------------------------------
 4 days until Thomas Jefferson's 266th Birthday

Re: bayes learn best practice

Reply via email to