After analyzing Mahout bayes code I found that priors are not taken into account. Mahout just provides some different version of Naive Bayes. Today I evaluated machine learning java library from http://mallet.cs.umass.edu . For the trivial test data presented below, it gives the results I was expecting to see. All records are classified as T.
csvline:1 T 0.8709677419354839 F 0.12903225806451615 csvline:2 T 0.8709677419354839 F 0.12903225806451615 csvline:3 T 0.8709677419354839 F 0.12903225806451615 csvline:4 T 0.6923076923076923 F 0.30769230769230765 csvline:5 T 0.6923076923076923 F 0.30769230769230765 csvline:6 T 0.6923076923076923 F 0.30769230769230765 2012/1/18 Daniel Korzekwa <[email protected]> > Hello, > > I'm training bayes classifier against this data (6 records): > > target, words > T A A A > T A A A > T A A A > T A A B > T A A B > F A A B > > with a command: > ./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model > -type bayes -ng 1 -source hdfs > > then I test this classifier against the same data with: > ./mahout testclassifier -d /mnt/hgfs/C/daniel/my_fav_data/test -m model > -type bayes -ng 1 -source hdfs -method sequential -v > > and I'm getting classification I cannot understand. All records are > classified as F, why is that?, shouldn't they be all classified as T? > 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 0 Line(30): T A > A A Expected Label: T Classified Label: F Correct: false > 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 1 Line(30): T A > A A Expected Label: T Classified Label: F Correct: false > 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 2 Line(30): T A > A A Expected Label: T Classified Label: F Correct: false > 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 3 Line(30): T A > A B Expected Label: T Classified Label: F Correct: false > 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 4 Line(30): T A > A B Expected Label: T Classified Label: F Correct: false > 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 5 Line(30): F A > A B Expected Label: F Classified Label: F Correct: true > > My reasoning (no smoothing applied): > Prior: > P(T) = 5/6 > P(F) = 1/6 > > P(A/T) = 13/15 > P(A/F) = 2/3 > > P(B/T) = 2/15 > P(B/F) = 1/3 > > Then I calculate posterior probability, e.g. P(T|A,A,B) = 0.7717 - record > classified as T. > > What is the reasoning behind classifying all records above as F? > > Any help much appreciated. > > PS. I was using mahout trunk from 16.01.2012. > > Regards. > Daniel > > -- > Daniel Korzekwa > Software Engineer > priv: http://danmachine.com > blog: http://blog.danmachine.com > -- Daniel Korzekwa Software Engineer priv: http://danmachine.com blog: http://blog.danmachine.com
