Unfortunately CNB gives me the same 66% accuracy.
I past the commands for mahout and weka below.
I also tried to remove the biggest class, it helps but then it's the 2nd
biggest class that is overwhelmingly predicted. Mahout bayes seems to favor
a lot the biggest class (more than prior), contrarily to Weka's
implementation. Is there any choice in the parameters, or in ways of
computing weights that could be causing this?
thanks.
benjamin
here are the commands:
On Mahout:
# training set, usual prepare20newsgroup, followed by a subsampling for just
a few classes
bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p
examples/bin/work/20news-bydate/20news-bydate-train -o
examples/bin/work/20news-bydate/bayes-train-input -a
org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
mkdir examples/bin/work/20news_ss/bayes-train-input/
head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt
> examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt
head -200
examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt >
examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt
head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt >
examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt
head -100
examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt >
examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt
head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt >
examples/bin/work/20news_ss/bayes-train-input/sci.med.txt
hdput examples/bin/work/20news_ss/bayes-train-input
examples/bin/work/20news_ss/bayes-train-input
then same exact thing for testing
# actual training:
bin/mahout trainclassifier -i
examples/bin/work/20news_ss/bayes-train-input -o
examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1
-source hdfs
# testing
bin/mahout testclassifier -d
examples/bin/work/20news_ss/bayes-test-input -m
examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1
-source hdfs
=> 66% accuracy
and for weka
# create the .arff file from 20news_ss train and test:
start the file with appropriate header:
-----
@relation _home_benjamin_Data_BY_weka
@attribute text string
@attribute class
{alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med}
@data
-----
# then past the data:
cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* |
perl mh2arff.pl >> 20news_ss_test.arff
# with m2harff.pl :
----
use strict;
while(<STDIN>) {
chomp;
$_ =~ s/\'/\\\'/g;
$_ =~ s/ $//;
my ($c,$t) = split("\t",$_);
print "'$t',$c\n";
}
---
# and the train/test command:
java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
20news_ss_test.arff -F
"weka.filters.unsupervised.attribute.StringToWordVector -S" -W
weka.classifiers.bayes.NaiveBayesMultinomial
=> 92% accuracy
2011/9/16 Robin Anil <[email protected]>
> Did you try complementary naive bayes(CNB). I am guessing the multinomial
> naivebayes mentioned here is a CNB like implementation and not NB.
>
>
> On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <[email protected]
> >wrote:
>
> > Hello,
> >
> > I'm giving a try to different classifiers for a classical problem of text
> > classification very close to the 20newsgroup one.
> > I end up with much better results with Weka NaiveBayesMultinomial than
> with
> > Mahout bayes.
> > The main problem comes from the fact that my data is unbalanced. I know
> > bayes has difficulties with that yet I'm surprised by the difference
> > between
> > weka and mahout.
> >
> > I went back to the 20newsgroup example, picked 5 classes only and
> > subsampled
> > those to get 5 classes with 400 200 100 100 and 30 examples and pretty
> much
> > the same for test set.
> > On mahout with bayes 1-gram, I'm getting 66% correctly classified (see
> > below
> > for confusion matrix)
> > On weka, on the same exact data, without any tuning, I'm getting 92%
> > correctly classified.
> >
> > Would anyone know where the difference comes from and if there are ways I
> > could tune Mahout to get better results? my data is small enough for now
> > for weka but this won't last.
> >
> > Many thanks
> >
> > Benjamin.
> >
> >
> >
> > MAHOUT:
> > -------------------------------------------------------
> > Correctly Classified Instances : 491 65.5541%
> > Incorrectly Classified Instances : 258 34.4459%
> > Total Classified Instances : 749
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a b c d e f <--Classified as
> > 14 82 0 4 0 0 | 100 a
> =
> > rec.sport.hockey
> > 0 319 0 0 0 0 | 319 b
> =
> > alt.atheism
> > 0 88 3 9 0 0 | 100 c
> =
> > rec.autos
> > 0 45 0 155 0 0 | 200 d
> =
> > comp.graphics
> > 0 25 0 5 0 0 | 30 e
> =
> > sci.med
> > 0 0 0 0 0 0 | 0 f
> =
> > unknown
> > Default Category: unknown: 5
> >
> >
> > WEKA:
> > java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
> > 20news_ss_test.arff -F
> > "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
> > weka.classifiers.bayes.NaiveBayesMultinomial
> >
> > === Error on test data ===
> >
> > Correctly Classified Instances 688 91.8558 %
> > Incorrectly Classified Instances 61 8.1442 %
> > Kappa statistic 0.8836
> > Mean absolute error 0.0334
> > Root mean squared error 0.1706
> > Relative absolute error 11.9863 %
> > Root relative squared error 45.151 %
> > Total Number of Instances 749
> >
> >
> > === Confusion Matrix ===
> >
> > a b c d e <-- classified as
> > 308 9 2 0 0 | a = alt.atheism
> > 5 195 0 0 0 | b = comp.graphics
> > 3 11 84 2 0 | c = rec.autos
> > 3 3 0 94 0 | d = rec.sport.hockey
> > 6 11 6 0 7 | e = sci.med
> >
>