On Tue, 13 Feb 2018 21:02:46 +0000
Horváth Szabolcs wrote:
> One more question: is there a recommended ham to spam ratio? 1:1?
No, this is a myth. Bayes computes token probabilities from a token's
frequencies in spam and ham, so it all scales through. If you have
2000 ham and 200 spam the problem is too few spams, not a bad ratio.
Theoretically there is a case for new training to match the ratio that's
already in the database because then a new token will get a token
probability that reflects its frequencies in recent mail. But I wouldn't
worry about that, it's hard to stick to, and probably minor.