Re: Bayes values for common words

Joe Emenaker 8 Jun 2004 01:23:48 -0000

Ole Nomann Thomsen wrote:

Hi, I noticed that words, common to all mails, seem to get at spamvalue of
close to zero, as in:
0.035      47222     446614 1086615228  Subject
Why is'nt it close to 0.500? As far as I can see, the word "Subject" should
have exactly no influence on spammishnes since it is always there.

Well, I'm quite a newbie as far as SA goes, but what I've been gathering regarding the Bayes component is that the score is *not* the statistical distribution of the word between spam messages, as a group, and ham messages, as a group. In that case, many common words would have Bayes scores near 0.500.

Instead, I think the Bayes actually takes into account the distribution between spam and ham. So, for example, if you *learn* 99 ham messages and 1 spam message, then your score for the word "Subject" would probably be somewhere around 0.010.

When you think about it, this makes a little more sense. You want to be able to scan a message, take a word from the message and, using that word, estimate the odds that the message is spam. If you receive 99 hams for each spam you get, then looking up "Subject" would tell you that there's a 1% chance that the message is spam.

Get it?

- Joe

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Bayes values for common words

Reply via email to