on Fri Apr 27 2007, David Abrahams <dave-AT-boost-consulting.com> wrote:
> OK, this is really weird. I have reasonably balanced spam and ham > folders (within about 15 messages of one another). I just used tte.py > to train them, without any ratio option, so it should have actually > been a balanced number of messages. Yet, when I run sb_imapfilter.py, > I see: > > $ sb_imapfilter.py -v > Loading state from /home/dave/spambayes/hammie.fs database > > /home/dave/spambayes/hammie.fs is an existing database, > with 282 spam and 76 ham > ^^^^^^^^^^^^^^^^^^^ > > What do those ham/spam numbers really mean? I have at least part of an answer: $ tte.py ... Loading state from /home/dave/spambayes/hammie.new.fs database /home/dave/spambayes/hammie.new.fs is a new database round: 1, msgs: 822, ham misses: 68, spam misses: 222, 73.4s round: 2, msgs: 822, ham misses: 8, spam misses: 56, 24.5s round: 3, msgs: 822, ham misses: 0, spam misses: 4, 20.6s round: 4, msgs: 822, ham misses: 0, spam misses: 0, 19.7s ************************************16 untrained spams 68+8 = 76 222+56+4 = 282 So, somehow, the number of hams or spams "in the database" really has to do with the number that are found to be misclassified and thus influence the training data? It's hard to understand the importance of keeping ham and spam balanced if one or the other can ultimately influence training so much more than the other. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com Don't Miss BoostCon 2007! ==> http://www.boostcon.com _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
