Meyer, Tony wrote: >>>If so, then this should happen - if you train the message twice, >>>then all the tokens for the message will be incremented twice and >>>the total count should be incremented twice. >> >>If so then how can one tell how many *distinct* massage are >>actually trained? It may be a little confusing if people try >>to use this information to follow the recommendation of >>"number of ham and spam of equal order". > > > SpamBayes doesn't care whether the ham or spam you train on are distinct > or not. It's the total number of messages, not distinct messages, that > counts. If you train on 500 copies of the same 2 ham and spam messages, > then the math will work fine (but of course, it'll only be any good at > recognising those two messages). > > =Tony.Meyer >
Now I begin to understand. You example would translate to the highly unlikely situation where the user receives only one mail of each all the time. To go further by paraphrasing your example, accidentally training 500 copies of 1 spam and 500 hundred (roughly distinct) ordinary ham mails will result in highly unbalanced "knowledge" of the mails that one receives. (Of course, in real situations, you do get repeated mails.) But then in this respect, knowing that you don't get into this kind of extreme situation when you train is nevertheless useful. On the other hand, you can tweak/weight the filter by training either ham or spam a couple of times more. Thanks very much again. I still have much to learn. Regards, ST -- _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
