Tim Peters wrote: > [Brendon Whateley] > >> I've just started using spambayes again after a while away from it. >> Now, 3 days in, I notice that I've trained on far more spam than ham. >> (Total emails trained: Spam: *432* Ham: *64) I seem to remember that >> this was previously my experience in the past. >> >> My question is; has anybody really tested the assertion that leads to >> the message: "**Warning: you have much more spam than ham - SpamBayes >> works best with approximately even numbers of ham and spam."?* >> > > Yes, but by the time you and Tony wrote your paper, serious > multi-corpus testing had long since essentially stopped. The results > with large imbalances were so dramatically worse that I introduced the > infamous "experimental ham spam imbalance adjustment" switch, which > tried to stop "the math" from drawing absurdly confident conclusions > from wildly unbalanced data (see the thread Mark pointed out). The > results of that were a mixed bag, helping some people a little but > hurting others more, so we dropped it. > Yes I remember that. I can also guess why serious multi-corpus testing stopped... as I recall, the pain of putting them together is not for the faint of heart :) > As I'm sure one of the text files in the project says, /all/ decisions > "should be" reevaluated periodically. Alas, a one-corpus test is > essentially useless, and it was hard even some years ago to arrange > for multi-corpus tests. > In the worst case, I can satisfy my own curiosity and possibly provide some insight. I may be able to gather several different corpora for some testing. How many separate corpora would you consider a valid test? > When the original testing was done, almost all spam was text-heavy, > meaning lots of tokens were generated. The paucity of tokens > generated for more recent image-based spam, and spam hiding in > attachments, makes SB's basic /approach/ less useful for that kind of > spam. No real idea how imbalance affects scoring spam of that kind. > That is the thinking that lead to my question of the imbalance effect. Perhaps some method of generating tokens from images would restore order to our world. > The only thing I've done in response to it is lower my "spam > threshold", down to 70 now, with ham at 5. My unsure rate is about > 6%, most of which are spam. Every now and again I add the 10 most > recent ham to my ham training data, but even so I've got about a 3:1 > spam:ham training ratio. I do expect my stats would improve if I > added more ham (I'm one of the ones the old imbalance option helped), > but I spend so little time looking at unsures it's just not worth even > tiny efforts to improve it. At the very least I can test your approach vs what I've been doing which is to just let the imbalance grow until some ham gets pulled into unsure. At that point I add unsure ham and continue on. At the very least, that answer may be of some help to those who find their training leads to large imbalances.
When I get back, I'll start playing with this and see if anything useful develops. Brendon. _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html