Skip Montanaro wrote: > When we tell people not to let their ham/spam imbalance get too bad, > we are referring to the number of messages trained. There is another > way to look at this imbalance though: number of tokens generated from > each stream. For me, ham messages are much larger on average than > spam messages. Consequently, for roughly the same number of tokens to > come from each stream, I need more spams than hams. Is there some > way to tell how this might affect scoring? Is it relevant to the > scoring?
Mathematically, the total number of tokens should have no effect on the probabilities. We only count a token once per message, and we divide the number of messages that have contained the token by the total number of messages. The total number of tokens never figures into the calculation at all. It would be interesting to know, though, if this type of imbalance might skew the selection of the significant tokens that figure into the calculation of the final score. If there are significantly more ham tokens in the training, is it more likely that the 150 significant tokens chosen will also have a higher percentage of ham tokens? -- Kenny Pitt _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
