Unfortunately CNB gives me the same 66% accuracy.

I past the commands for mahout and weka below.

I also tried to remove the biggest class, it helps but then it's the 2nd
biggest class that is overwhelmingly predicted. Mahout bayes seems to favor
a lot the biggest class (more than prior), contrarily to Weka's
implementation. Is there any choice in the parameters, or in ways of
computing weights that could be  causing this?

thanks.

benjamin

here are the commands:
On Mahout:
# training set, usual prepare20newsgroup, followed by a subsampling for just
a few classes
bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups   -p
examples/bin/work/20news-bydate/20news-bydate-train   -o
examples/bin/work/20news-bydate/bayes-train-input   -a
org.apache.mahout.vectorizer.DefaultAnalyzer   -c UTF-8
mkdir examples/bin/work/20news_ss/bayes-train-input/
head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt
> examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt
head -200
examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt >
examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt
head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt >
examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt
head -100
examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt >
examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt
head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt >
examples/bin/work/20news_ss/bayes-train-input/sci.med.txt
hdput examples/bin/work/20news_ss/bayes-train-input
examples/bin/work/20news_ss/bayes-train-input

   then same exact thing for testing

# actual training:
bin/mahout trainclassifier   -i
examples/bin/work/20news_ss/bayes-train-input   -o
examples/bin/work/20news-bydate/cbayes-model_ss   -type cbayes   -ng 1
-source hdfs

# testing
bin/mahout testclassifier   -d
examples/bin/work/20news_ss/bayes-test-input   -m
examples/bin/work/20news-bydate/cbayes-model_ss   -type cbayes   -ng 1
-source hdfs

=> 66% accuracy

and for weka
# create the .arff file from 20news_ss train and test:
start the file with appropriate header:
-----
@relation _home_benjamin_Data_BY_weka

@attribute text string
@attribute class
{alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med}

@data
-----
# then past the data:
cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* |
perl mh2arff.pl >> 20news_ss_test.arff
# with m2harff.pl :
----
use strict;
while(<STDIN>) {
    chomp;
    $_ =~ s/\'/\\\'/g;
    $_ =~ s/ $//;
    my ($c,$t) = split("\t",$_);
    print "'$t',$c\n";
}
---
# and the train/test command:
java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
20news_ss_test.arff -F
"weka.filters.unsupervised.attribute.StringToWordVector -S" -W
weka.classifiers.bayes.NaiveBayesMultinomial

=> 92% accuracy






2011/9/16 Robin Anil <[email protected]>

> Did you try complementary naive bayes(CNB). I am guessing the multinomial
> naivebayes mentioned here is a CNB like implementation and not NB.
>
>
> On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <[email protected]
> >wrote:
>
> > Hello,
> >
> > I'm giving a try to different classifiers for a classical problem of text
> > classification very close to the 20newsgroup one.
> > I end up with much better results with Weka NaiveBayesMultinomial than
> with
> > Mahout bayes.
> > The main problem comes from the fact that my data is unbalanced. I know
> > bayes has difficulties with that yet I'm surprised by the difference
> > between
> > weka and mahout.
> >
> > I went back to the 20newsgroup example, picked 5 classes only and
> > subsampled
> > those to get 5 classes with 400 200 100 100 and 30 examples and pretty
> much
> > the same for test set.
> > On mahout with bayes 1-gram, I'm getting 66% correctly classified (see
> > below
> > for confusion matrix)
> > On weka, on the same exact data, without any tuning, I'm getting 92%
> > correctly classified.
> >
> > Would anyone know where the difference comes from and if there are ways I
> > could tune Mahout to get better results? my data is  small enough for now
> > for weka but this won't last.
> >
> > Many thanks
> >
> > Benjamin.
> >
> >
> >
> > MAHOUT:
> > -------------------------------------------------------
> > Correctly Classified Instances          :        491       65.5541%
> > Incorrectly Classified Instances        :        258       34.4459%
> > Total Classified Instances              :        749
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a        b        c        d        e        f        <--Classified as
> > 14       82       0        4        0        0         |  100       a
> =
> > rec.sport.hockey
> > 0        319      0        0        0        0         |  319       b
> =
> > alt.atheism
> > 0        88       3        9        0        0         |  100       c
> =
> > rec.autos
> > 0        45       0        155      0        0         |  200       d
> =
> > comp.graphics
> > 0        25       0        5        0        0         |  30        e
> =
> > sci.med
> > 0        0        0        0        0        0         |  0         f
> =
> > unknown
> > Default Category: unknown: 5
> >
> >
> > WEKA:
> > java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
> > 20news_ss_test.arff -F
> > "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
> > weka.classifiers.bayes.NaiveBayesMultinomial
> >
> > === Error on test data ===
> >
> > Correctly Classified Instances         688               91.8558 %
> > Incorrectly Classified Instances        61                8.1442 %
> > Kappa statistic                          0.8836
> > Mean absolute error                      0.0334
> > Root mean squared error                  0.1706
> > Relative absolute error                 11.9863 %
> > Root relative squared error             45.151  %
> > Total Number of Instances              749
> >
> >
> > === Confusion Matrix ===
> >
> >   a   b   c   d   e   <-- classified as
> >  308   9   2   0   0 |   a = alt.atheism
> >   5 195   0   0   0 |   b = comp.graphics
> >   3  11  84   2   0 |   c = rec.autos
> >   3   3   0  94   0 |   d = rec.sport.hockey
> >   6  11   6   0   7 |   e = sci.med
> >
>

Reply via email to