Hello all,


I have a simple idea for the implementation of a Bayesian classifier that uses Bayes factors.

Suppose we have the word "viagra" in the following situation:

The word was found in 10 ham mails, and was not seen in 20 ham mails (=total 30 ham emails)
The word was found in 50 spam mails, and was not seen in 30 spam mails.

The procedure now is to calculate

g(w)=10/(10+20)
b(w) = 50/(50+30)

and then

p(w)=b(w)/(b(w)+g(w))

I suggest the following calculation: first add a prior value of 1 to each cell (so no problem with non-observed words), then calculate the log(odds):

LogOdds=log  (( 11*31 ) / (21*51))

The standard deviation is given by stdev = sqrt( 1/11+1/21+1/51+1/31 )

Next is to calculate the Bayes factors that a word is a spam indicator versus that is not a spam indicator:
help=pNorm (0, LogOdds), stdev )
 
where pNorm is in the words of Gary " the inverse normal function, used to derive a p-value from a normal-distributed random variable"
Bayes factors is given by
 
BF=help/(1-help)
 

The interpretation is simple: if the value is larger than 1, it is more
likely being spam. The number can be given a better interpretation, but for
the moment, the criterion is: larger than 1=spam, smaller than 1=ham.

For Bayes factor, the product rule applies: the total Bayes factor is the
product of all the Bayes factors of the individual words in the email to be
classified.

BF_total=BF(word_1) * BF(word_2) *...* BF(word_n)
 
 
Some values using 1 word:

H: 10/10     S:50/50    BF=1
H: 100/100 S:500/500 BF=1
-----------------------------------
H: 1/2     S:3/4    BF=1.5
H: 10/20 S:30/40 BF=4.3
-----------------------------------
H: 3/10 S:50/10  BF=very small
 
 
Any suggestions?
 
 
 
All the best,
 
Olav Laudy
 

 
 
 
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to