On Thursday, February 02, 2006 10:35 PM -0600, Bob Posert wrote: > Back in > http://mail.python.org/pipermail/spambayes/2006-January/018702.html > , Tim Peters and I had a dialog about training on unusual ham - > monthly messages from http://www.boldtype.com. I just got another > one and it scored 50% on the spam scale. The clues follow - I'd > really appreciate any help. Thanks, Bob > > Combined Score: 50% (0.5) Internal ham score (*H*): 1 > Internal spam score (*S*): 1 > > # ham trained on: 1229 > # spam trained on: 20331 > 150 Significant Tokens
I couldn't help but notice the ratio of trained spam to trained ham is very high. While the statistics _should_ still work properly in these cases, a number of people have observed difficulties when the number trained ham and spam are very different. I don't think anyone has a good explanation as to why, nor is there any guaranteed "safe" ratio. As a start, I'd suggest no more than 2:1 in either direction, with maybe 5:1 as an outer bound, but that's just a SWAG (sophisticated wild-ass guess). For you to test this, you'd have to retrain, unfortunately. Save your current databases first, so you can revert if you don't like the results. -- Seth Goodman _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
