Matt Kettler wrote:
At 11:15 AM 2/11/2005, Matías López Bergero wrote:

The sa-learn man page says that for a good training of the Bayesian
filter, you need to train it with equal amounts of spam and ham, or more
ham if is possible. So if I sa-learn the spam folder, the spam tokens
are going to grow a lot compared to ham tokens.

This possible increase in the spam data would have adverse effects on
the bayes filter classifying the spam or ham messages??

The manpage is suggesting an ideal situation.. Really, you can be pretty wildly off and bayes will work reasonably well.


Training a lot more spam than ham makes bayes more likely to misclassify a nonspam email as spam, but really, I'm VERY off balance and I've not had any problems with this at all. The difference between spam and nonspam here is just too great. Even a massive imbalance isn't causing FPs.

Look at my stats:

0.000          0          2          0  non-token data: bayes db version
0.000          0     565896          0  non-token data: nspam
0.000          0      24693          0  non-token data: nham
0.000          0     180900          0  non-token data: ntokens

My spam training outnumbers my ham training by 22:1. That's pretty far off from the ideal 1:1 or 1:1+. I've got more FN problems than FP problems with my bayes DB, but I rarely have problems with either.

Wow!
Thank you very much for answer Matt! I'm going to sa-learn those spam messages right now.
I was to worried about the spam/ham data balance, now it's going to be much easier to train the Bayesian filter :)


Also I'm storing the non detected spam messages and the ham data, and from time to time I re-run a sa-learn on those files. The sa-learn does not report any learning but, I have read some where, probably the sa-learn man page, that this is a good thing to do, because it will help to reinforcing the bayes info and Bayesian filter decisions. Did I get this right?

BR,
Matías.

Reply via email to