|
Hello all,
I have a simple idea for the implementation of a Bayesian classifier that uses Bayes factors. Suppose we have the word "viagra" in the following situation: The word was found in 10 ham mails, and was not seen in 20 ham mails (=total 30 ham emails) The word was found in 50 spam mails, and was not seen in 30 spam mails. The procedure now is to calculate g(w)=10/(10+20) b(w) = 50/(50+30) and then p(w)=b(w)/(b(w)+g(w)) I suggest the following calculation: first add a prior value of 1 to each cell (so no problem with non-observed words), then calculate the log(odds): LogOdds=log (( 11*31 ) / (21*51)) The standard deviation is given by stdev = sqrt( 1/11+1/21+1/51+1/31 ) Next is to calculate the Bayes factors that a word is a spam indicator versus that is not a spam indicator: help=pNorm (0, LogOdds), stdev )
where pNorm is in the words of Gary " the inverse normal
function, used to derive a p-value from a normal-distributed random
variable"
Bayes factors is given by
BF=help/(1-help)
The interpretation is simple: if the value is larger than 1, it is more likely being spam. The number can be given a better interpretation, but for the moment, the criterion is: larger than 1=spam, smaller than 1=ham. For Bayes factor, the product rule applies: the total Bayes factor is the product of all the Bayes factors of the individual words in the email to be classified. BF_total=BF(word_1) * BF(word_2) *...* BF(word_n) Some values using 1 word:
H: 10/10 S:50/50 BF=1 H: 100/100 S:500/500 BF=1
-----------------------------------
H: 1/2 S:3/4 BF=1.5
H: 10/20 S:30/40 BF=4.3
-----------------------------------
H: 3/10 S:50/10 BF=very small
Any suggestions?
All the best,
Olav Laudy
|
_______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
