Did you try complementary naive bayes(CNB). I am guessing the multinomial
naivebayes mentioned here is a CNB like implementation and not NB.


On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <[email protected]>wrote:

> Hello,
>
> I'm giving a try to different classifiers for a classical problem of text
> classification very close to the 20newsgroup one.
> I end up with much better results with Weka NaiveBayesMultinomial than with
> Mahout bayes.
> The main problem comes from the fact that my data is unbalanced. I know
> bayes has difficulties with that yet I'm surprised by the difference
> between
> weka and mahout.
>
> I went back to the 20newsgroup example, picked 5 classes only and
> subsampled
> those to get 5 classes with 400 200 100 100 and 30 examples and pretty much
> the same for test set.
> On mahout with bayes 1-gram, I'm getting 66% correctly classified (see
> below
> for confusion matrix)
> On weka, on the same exact data, without any tuning, I'm getting 92%
> correctly classified.
>
> Would anyone know where the difference comes from and if there are ways I
> could tune Mahout to get better results? my data is  small enough for now
> for weka but this won't last.
>
> Many thanks
>
> Benjamin.
>
>
>
> MAHOUT:
> -------------------------------------------------------
> Correctly Classified Instances          :        491       65.5541%
> Incorrectly Classified Instances        :        258       34.4459%
> Total Classified Instances              :        749
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a        b        c        d        e        f        <--Classified as
> 14       82       0        4        0        0         |  100       a     =
> rec.sport.hockey
> 0        319      0        0        0        0         |  319       b     =
> alt.atheism
> 0        88       3        9        0        0         |  100       c     =
> rec.autos
> 0        45       0        155      0        0         |  200       d     =
> comp.graphics
> 0        25       0        5        0        0         |  30        e     =
> sci.med
> 0        0        0        0        0        0         |  0         f     =
> unknown
> Default Category: unknown: 5
>
>
> WEKA:
> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
> 20news_ss_test.arff -F
> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
> weka.classifiers.bayes.NaiveBayesMultinomial
>
> === Error on test data ===
>
> Correctly Classified Instances         688               91.8558 %
> Incorrectly Classified Instances        61                8.1442 %
> Kappa statistic                          0.8836
> Mean absolute error                      0.0334
> Root mean squared error                  0.1706
> Relative absolute error                 11.9863 %
> Root relative squared error             45.151  %
> Total Number of Instances              749
>
>
> === Confusion Matrix ===
>
>   a   b   c   d   e   <-- classified as
>  308   9   2   0   0 |   a = alt.atheism
>   5 195   0   0   0 |   b = comp.graphics
>   3  11  84   2   0 |   c = rec.autos
>   3   3   0  94   0 |   d = rec.sport.hockey
>   6  11   6   0   7 |   e = sci.med
>

Reply via email to