[Spambayes] A Couple of Training Questions

David Abrahams Tue, 08 May 2007 08:15:01 -0700

Q1:

I have a cron job that runs sb_imapfilter.py to train periodically
from my ham/spam corpus folders.


AFAICT, that will train only as-yet-untrained messages.  I know
there's supposed to be something about keeping ham and spam
balanced. If I start out with 1000 messages in each folder, then dump
10 into just the ham folder, the next training run will train 10 hams
and no spams.  Is that very bad for future performance, or is that
temporary imbalance strongly mitigated by the overall size of the two
folders?

Q2:

I notice that the incremental training of sb_imapfilter trains all
(as-yet-untrained) hams, then all (as-yet-untrained) spams.  However,
Skip's train-to-exhaustion script tries to interleave training of Hams
and Spams.  Is that interleaving only important for
train-to-exhaustion, or should all methods use it?

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Don't Miss BoostCon 2007! ==> http://www.boostcon.com

_______________________________________________
SpamBayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

[Spambayes] A Couple of Training Questions

Reply via email to