Q1: I have a cron job that runs sb_imapfilter.py to train periodically from my ham/spam corpus folders.
AFAICT, that will train only as-yet-untrained messages. I know there's supposed to be something about keeping ham and spam balanced. If I start out with 1000 messages in each folder, then dump 10 into just the ham folder, the next training run will train 10 hams and no spams. Is that very bad for future performance, or is that temporary imbalance strongly mitigated by the overall size of the two folders? Q2: I notice that the incremental training of sb_imapfilter trains all (as-yet-untrained) hams, then all (as-yet-untrained) spams. However, Skip's train-to-exhaustion script tries to interleave training of Hams and Spams. Is that interleaving only important for train-to-exhaustion, or should all methods use it? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com Don't Miss BoostCon 2007! ==> http://www.boostcon.com _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html