[Brendon Whateley] > I've just started using spambayes again after a while away from it. > Now, 3 days in, I notice that I've trained on far more spam than ham. > (Total emails trained: Spam: *432* Ham: *64) I seem to remember that > this was previously my experience in the past. > > My question is; has anybody really tested the assertion that leads to > the message: "**Warning: you have much more spam than ham - SpamBayes > works best with approximately even numbers of ham and spam."?*
Yes, but by the time you and Tony wrote your paper, serious multi-corpus testing had long since essentially stopped. The results with large imbalances were so dramatically worse that I introduced the infamous "experimental ham spam imbalance adjustment" switch, which tried to stop "the math" from drawing absurdly confident conclusions from wildly unbalanced data (see the thread Mark pointed out). The results of that were a mixed bag, helping some people a little but hurting others more, so we dropped it. As I'm sure one of the text files in the project says, /all/ decisions "should be" reevaluated periodically. Alas, a one-corpus test is essentially useless, and it was hard even some years ago to arrange for multi-corpus tests. When the original testing was done, almost all spam was text-heavy, meaning lots of tokens were generated. The paucity of tokens generated for more recent image-based spam, and spam hiding in attachments, makes SB's basic /approach/ less useful for that kind of spam. No real idea how imbalance affects scoring spam of that kind. The only thing I've done in response to it is lower my "spam threshold", down to 70 now, with ham at 5. My unsure rate is about 6%, most of which are spam. Every now and again I add the 10 most recent ham to my ham training data, but even so I've got about a 3:1 spam:ham training ratio. I do expect my stats would improve if I added more ham (I'm one of the ones the old imbalance option helped), but I spend so little time looking at unsures it's just not worth even tiny efforts to improve it. _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html