-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Am Samstag, 8. Januar 2005 22:55 schrieb Fajar Priyanto: > On Sunday 09 January 2005 04:47 am, Matt Kettler wrote: [..]
> > Train spam as spam, train ham as ham. Let the statistics deal with the > > overlap. By trying to avoid training "spamish" ham or "hamish" spam > > you're just doing your training a big disservice by making it > > unrealistic. > > Thanks Matt, > So talking statistically, does it mean I have to train SA about 'ham' as > many as 'spam'? Right now, I train SA mostly on spams. You must train ham and spam. How should the Bayes filter now what is ham if you didn't train it? As far as I understand the Bayes filter searches for tokens in the email. If a token was found in 30 spam and 10 ham mails then the propability for being spam is 75%. But if you only train spam the Bayes filter would say: if have learned 30 spam mails but no ham so the propability for being spam is 100%. (The bayes calculation is done with some ham/spam tokens. How many tokens are taken into account I don't know) If you only / mostly train spam this will poison your database and the FalsePositves will grow. To keep FalsePositive low, you should teach SA all ham. Its unlikely to train as much ham as spam because there is more spam. But this is no harm. The Bayesian filter work on tokens found. Lets assume you have teached 200 spam and 100 ham. 100 spam and 100 ham contained the token x. If x is found in an new message, then the spam prob is 50% even if the propability of being in a ham message is 100%. If you teach only half the ham messages the spam-ham ratio would be 100 to 50 which gives a propability of 66% for being spam. Regards Thomas - -- icq:133073900 http://www.t-arend.de -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFB4RLeHe2ZLU3NgHsRAjgSAKCHYwQWLMJExHdtrgb0OLXHHy00XwCeKIyw Y7oZeRBZ22sOlpZFmc5Ln7M= =i9Cw -----END PGP SIGNATURE-----