Hello,

I'm giving a try to different classifiers for a classical problem of text
classification very close to the 20newsgroup one.
I end up with much better results with Weka NaiveBayesMultinomial than with
Mahout bayes.
The main problem comes from the fact that my data is unbalanced. I know
bayes has difficulties with that yet I'm surprised by the difference between
weka and mahout.

I went back to the 20newsgroup example, picked 5 classes only and subsampled
those to get 5 classes with 400 200 100 100 and 30 examples and pretty much
the same for test set.
On mahout with bayes 1-gram, I'm getting 66% correctly classified (see below
for confusion matrix)
On weka, on the same exact data, without any tuning, I'm getting 92%
correctly classified.

Would anyone know where the difference comes from and if there are ways I
could tune Mahout to get better results? my data is  small enough for now
for weka but this won't last.

Many thanks

Benjamin.



MAHOUT:
-------------------------------------------------------
Correctly Classified Instances          :        491       65.5541%
Incorrectly Classified Instances        :        258       34.4459%
Total Classified Instances              :        749

=======================================================
Confusion Matrix
-------------------------------------------------------
a        b        c        d        e        f        <--Classified as
14       82       0        4        0        0         |  100       a     =
rec.sport.hockey
0        319      0        0        0        0         |  319       b     =
alt.atheism
0        88       3        9        0        0         |  100       c     =
rec.autos
0        45       0        155      0        0         |  200       d     =
comp.graphics
0        25       0        5        0        0         |  30        e     =
sci.med
0        0        0        0        0        0         |  0         f     =
unknown
Default Category: unknown: 5


WEKA:
java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
20news_ss_test.arff -F
"weka.filters.unsupervised.attribute.StringToWordVector -S" -W
weka.classifiers.bayes.NaiveBayesMultinomial

=== Error on test data ===

Correctly Classified Instances         688               91.8558 %
Incorrectly Classified Instances        61                8.1442 %
Kappa statistic                          0.8836
Mean absolute error                      0.0334
Root mean squared error                  0.1706
Relative absolute error                 11.9863 %
Root relative squared error             45.151  %
Total Number of Instances              749


=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 308   9   2   0   0 |   a = alt.atheism
   5 195   0   0   0 |   b = comp.graphics
   3  11  84   2   0 |   c = rec.autos
   3   3   0  94   0 |   d = rec.sport.hockey
   6  11   6   0   7 |   e = sci.med

Reply via email to