Skip Montanaro wrote:
> When we tell people not to let their ham/spam imbalance get too bad,
> we are referring to the number of messages trained.  There is another
> way to look at this imbalance though: number of tokens generated from
> each stream.  For me, ham messages are much larger on average than
> spam messages. Consequently, for roughly the same number of tokens to
> come from each stream, I need more spams than hams.  Is there some
> way to tell how this might affect scoring?  Is it relevant to the
> scoring? 

Mathematically, the total number of tokens should have no effect on the
probabilities.  We only count a token once per message, and we divide the
number of messages that have contained the token by the total number of
messages.  The total number of tokens never figures into the calculation at
all.

It would be interesting to know, though, if this type of imbalance might
skew the selection of the significant tokens that figure into the
calculation of the final score.  If there are significantly more ham tokens
in the training, is it more likely that the 150 significant tokens chosen
will also have a higher percentage of ham tokens?

-- 
Kenny Pitt

_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to