Re: 92% accuracy on Weka NaiveBayesMultinomial vs 66% with Mahout bayes

Ted Dunning Fri, 16 Sep 2011 15:24:20 -0700

Benjamin,

Can you post your actual training data on dropbox or some other place so
that we can replicate the problem?


On Fri, Sep 16, 2011 at 3:38 PM, Benjamin Rey <[email protected]>wrote:

> Unfortunately CNB gives me the same 66% accuracy.
>
> I past the commands for mahout and weka below.
>
> I also tried to remove the biggest class, it helps but then it's the 2nd
> biggest class that is overwhelmingly predicted. Mahout bayes seems to favor
> a lot the biggest class (more than prior), contrarily to Weka's
> implementation. Is there any choice in the parameters, or in ways of
> computing weights that could be  causing this?
>
> thanks.
>
> benjamin
>
> here are the commands:
> On Mahout:
> # training set, usual prepare20newsgroup, followed by a subsampling for
> just
> a few classes
> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups   -p
> examples/bin/work/20news-bydate/20news-bydate-train   -o
> examples/bin/work/20news-bydate/bayes-train-input   -a
> org.apache.mahout.vectorizer.DefaultAnalyzer   -c UTF-8
> mkdir examples/bin/work/20news_ss/bayes-train-input/
> head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt
> > examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt
> head -200
> examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt >
> examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt
> head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt >
> examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt
> head -100
> examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt >
> examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt
> head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt >
> examples/bin/work/20news_ss/bayes-train-input/sci.med.txt
> hdput examples/bin/work/20news_ss/bayes-train-input
> examples/bin/work/20news_ss/bayes-train-input
>
>   then same exact thing for testing
>
> # actual training:
> bin/mahout trainclassifier   -i
> examples/bin/work/20news_ss/bayes-train-input   -o
> examples/bin/work/20news-bydate/cbayes-model_ss   -type cbayes   -ng 1
> -source hdfs
>
> # testing
> bin/mahout testclassifier   -d
> examples/bin/work/20news_ss/bayes-test-input   -m
> examples/bin/work/20news-bydate/cbayes-model_ss   -type cbayes   -ng 1
> -source hdfs
>
> => 66% accuracy
>
> and for weka
> # create the .arff file from 20news_ss train and test:
> start the file with appropriate header:
> -----
> @relation _home_benjamin_Data_BY_weka
>
> @attribute text string
> @attribute class
> {alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med}
>
> @data
> -----
> # then past the data:
> cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* |
> perl mh2arff.pl >> 20news_ss_test.arff
> # with m2harff.pl :
> ----
> use strict;
> while(<STDIN>) {
>    chomp;
>    $_ =~ s/\'/\\\'/g;
>    $_ =~ s/ $//;
>    my ($c,$t) = split("\t",$_);
>    print "'$t',$c\n";
> }
> ---
> # and the train/test command:
> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
> 20news_ss_test.arff -F
> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
> weka.classifiers.bayes.NaiveBayesMultinomial
>
> => 92% accuracy
>
>
>
>
>
>
> 2011/9/16 Robin Anil <[email protected]>
>
> > Did you try complementary naive bayes(CNB). I am guessing the multinomial
> > naivebayes mentioned here is a CNB like implementation and not NB.
> >
> >
> > On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <
> [email protected]
> > >wrote:
> >
> > > Hello,
> > >
> > > I'm giving a try to different classifiers for a classical problem of
> text
> > > classification very close to the 20newsgroup one.
> > > I end up with much better results with Weka NaiveBayesMultinomial than
> > with
> > > Mahout bayes.
> > > The main problem comes from the fact that my data is unbalanced. I know
> > > bayes has difficulties with that yet I'm surprised by the difference
> > > between
> > > weka and mahout.
> > >
> > > I went back to the 20newsgroup example, picked 5 classes only and
> > > subsampled
> > > those to get 5 classes with 400 200 100 100 and 30 examples and pretty
> > much
> > > the same for test set.
> > > On mahout with bayes 1-gram, I'm getting 66% correctly classified (see
> > > below
> > > for confusion matrix)
> > > On weka, on the same exact data, without any tuning, I'm getting 92%
> > > correctly classified.
> > >
> > > Would anyone know where the difference comes from and if there are ways
> I
> > > could tune Mahout to get better results? my data is  small enough for
> now
> > > for weka but this won't last.
> > >
> > > Many thanks
> > >
> > > Benjamin.
> > >
> > >
> > >
> > > MAHOUT:
> > > -------------------------------------------------------
> > > Correctly Classified Instances          :        491       65.5541%
> > > Incorrectly Classified Instances        :        258       34.4459%
> > > Total Classified Instances              :        749
> > >
> > > =======================================================
> > > Confusion Matrix
> > > -------------------------------------------------------
> > > a        b        c        d        e        f        <--Classified as
> > > 14       82       0        4        0        0         |  100       a
> > =
> > > rec.sport.hockey
> > > 0        319      0        0        0        0         |  319       b
> > =
> > > alt.atheism
> > > 0        88       3        9        0        0         |  100       c
> > =
> > > rec.autos
> > > 0        45       0        155      0        0         |  200       d
> > =
> > > comp.graphics
> > > 0        25       0        5        0        0         |  30        e
> > =
> > > sci.med
> > > 0        0        0        0        0        0         |  0         f
> > =
> > > unknown
> > > Default Category: unknown: 5
> > >
> > >
> > > WEKA:
> > > java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff
> -T
> > > 20news_ss_test.arff -F
> > > "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
> > > weka.classifiers.bayes.NaiveBayesMultinomial
> > >
> > > === Error on test data ===
> > >
> > > Correctly Classified Instances         688               91.8558 %
> > > Incorrectly Classified Instances        61                8.1442 %
> > > Kappa statistic                          0.8836
> > > Mean absolute error                      0.0334
> > > Root mean squared error                  0.1706
> > > Relative absolute error                 11.9863 %
> > > Root relative squared error             45.151  %
> > > Total Number of Instances              749
> > >
> > >
> > > === Confusion Matrix ===
> > >
> > >   a   b   c   d   e   <-- classified as
> > >  308   9   2   0   0 |   a = alt.atheism
> > >   5 195   0   0   0 |   b = comp.graphics
> > >   3  11  84   2   0 |   c = rec.autos
> > >   3   3   0  94   0 |   d = rec.sport.hockey
> > >   6  11   6   0   7 |   e = sci.med
> > >
> >
>

Re: 92% accuracy on Weka NaiveBayesMultinomial vs 66% with Mahout bayes

Reply via email to