On Thu, 9 Apr 2009, Arthur Kerpician wrote:

I tried to manually keep both spam and ham at the same level in the bayes db but it seems that spamassassin is learning spam twice as fast as ham.

Not surprising, as raw email traffic has a very skewed spam:ham ratio. Surely you've heard the stats that "90% of all email is spam"?

The docs mention that after 5000 spam and ham learned, spamassassin doesn't improve spam detection much. What is the best practice to optimize the bayes detection? Should I stop auto-learning after reaching the 5000 mark and than re-train from time to time from scratch?

I'll let others comment on issues like disk space and scan time w/r/t bayes database size. For myself, I have a _very_ small userbase and do purely manual training with a small corpus. I have under 3000 tokens total and get good results.

Build good representative ham and spam corpa, and train any misses (FPs and FNs) going forward. Retain those messages. Unfortunately autolearn doesn't let you retain those messages.

Retraining from scratch is only really necessary if things have gone completely out of whack, and at that point you review your corpa carefully for misclassified messages, wipe and retrain. Bayes should only go bonkers if you have people manually training messages incorrectly, or (not too likely) if autolearn has taken a slightly-poor configuration and magnified the errors.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Gun Control enables genocide while doing little to reduce crime.
-----------------------------------------------------------------------
 4 days until Thomas Jefferson's 266th Birthday

Reply via email to