Benjamin, Can you post your actual training data on dropbox or some other place so that we can replicate the problem?
On Fri, Sep 16, 2011 at 3:38 PM, Benjamin Rey <[email protected]>wrote: > Unfortunately CNB gives me the same 66% accuracy. > > I past the commands for mahout and weka below. > > I also tried to remove the biggest class, it helps but then it's the 2nd > biggest class that is overwhelmingly predicted. Mahout bayes seems to favor > a lot the biggest class (more than prior), contrarily to Weka's > implementation. Is there any choice in the parameters, or in ways of > computing weights that could be causing this? > > thanks. > > benjamin > > here are the commands: > On Mahout: > # training set, usual prepare20newsgroup, followed by a subsampling for > just > a few classes > bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p > examples/bin/work/20news-bydate/20news-bydate-train -o > examples/bin/work/20news-bydate/bayes-train-input -a > org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 > mkdir examples/bin/work/20news_ss/bayes-train-input/ > head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt > > examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt > head -200 > examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt > > examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt > head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt > > examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt > head -100 > examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt > > examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt > head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt > > examples/bin/work/20news_ss/bayes-train-input/sci.med.txt > hdput examples/bin/work/20news_ss/bayes-train-input > examples/bin/work/20news_ss/bayes-train-input > > then same exact thing for testing > > # actual training: > bin/mahout trainclassifier -i > examples/bin/work/20news_ss/bayes-train-input -o > examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1 > -source hdfs > > # testing > bin/mahout testclassifier -d > examples/bin/work/20news_ss/bayes-test-input -m > examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1 > -source hdfs > > => 66% accuracy > > and for weka > # create the .arff file from 20news_ss train and test: > start the file with appropriate header: > ----- > @relation _home_benjamin_Data_BY_weka > > @attribute text string > @attribute class > {alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med} > > @data > ----- > # then past the data: > cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* | > perl mh2arff.pl >> 20news_ss_test.arff > # with m2harff.pl : > ---- > use strict; > while(<STDIN>) { > chomp; > $_ =~ s/\'/\\\'/g; > $_ =~ s/ $//; > my ($c,$t) = split("\t",$_); > print "'$t',$c\n"; > } > --- > # and the train/test command: > java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T > 20news_ss_test.arff -F > "weka.filters.unsupervised.attribute.StringToWordVector -S" -W > weka.classifiers.bayes.NaiveBayesMultinomial > > => 92% accuracy > > > > > > > 2011/9/16 Robin Anil <[email protected]> > > > Did you try complementary naive bayes(CNB). I am guessing the multinomial > > naivebayes mentioned here is a CNB like implementation and not NB. > > > > > > On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey < > [email protected] > > >wrote: > > > > > Hello, > > > > > > I'm giving a try to different classifiers for a classical problem of > text > > > classification very close to the 20newsgroup one. > > > I end up with much better results with Weka NaiveBayesMultinomial than > > with > > > Mahout bayes. > > > The main problem comes from the fact that my data is unbalanced. I know > > > bayes has difficulties with that yet I'm surprised by the difference > > > between > > > weka and mahout. > > > > > > I went back to the 20newsgroup example, picked 5 classes only and > > > subsampled > > > those to get 5 classes with 400 200 100 100 and 30 examples and pretty > > much > > > the same for test set. > > > On mahout with bayes 1-gram, I'm getting 66% correctly classified (see > > > below > > > for confusion matrix) > > > On weka, on the same exact data, without any tuning, I'm getting 92% > > > correctly classified. > > > > > > Would anyone know where the difference comes from and if there are ways > I > > > could tune Mahout to get better results? my data is small enough for > now > > > for weka but this won't last. > > > > > > Many thanks > > > > > > Benjamin. > > > > > > > > > > > > MAHOUT: > > > ------------------------------------------------------- > > > Correctly Classified Instances : 491 65.5541% > > > Incorrectly Classified Instances : 258 34.4459% > > > Total Classified Instances : 749 > > > > > > ======================================================= > > > Confusion Matrix > > > ------------------------------------------------------- > > > a b c d e f <--Classified as > > > 14 82 0 4 0 0 | 100 a > > = > > > rec.sport.hockey > > > 0 319 0 0 0 0 | 319 b > > = > > > alt.atheism > > > 0 88 3 9 0 0 | 100 c > > = > > > rec.autos > > > 0 45 0 155 0 0 | 200 d > > = > > > comp.graphics > > > 0 25 0 5 0 0 | 30 e > > = > > > sci.med > > > 0 0 0 0 0 0 | 0 f > > = > > > unknown > > > Default Category: unknown: 5 > > > > > > > > > WEKA: > > > java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff > -T > > > 20news_ss_test.arff -F > > > "weka.filters.unsupervised.attribute.StringToWordVector -S" -W > > > weka.classifiers.bayes.NaiveBayesMultinomial > > > > > > === Error on test data === > > > > > > Correctly Classified Instances 688 91.8558 % > > > Incorrectly Classified Instances 61 8.1442 % > > > Kappa statistic 0.8836 > > > Mean absolute error 0.0334 > > > Root mean squared error 0.1706 > > > Relative absolute error 11.9863 % > > > Root relative squared error 45.151 % > > > Total Number of Instances 749 > > > > > > > > > === Confusion Matrix === > > > > > > a b c d e <-- classified as > > > 308 9 2 0 0 | a = alt.atheism > > > 5 195 0 0 0 | b = comp.graphics > > > 3 11 84 2 0 | c = rec.autos > > > 3 3 0 94 0 | d = rec.sport.hockey > > > 6 11 6 0 7 | e = sci.med > > > > > >
