Did you try complementary naive bayes(CNB). I am guessing the multinomial naivebayes mentioned here is a CNB like implementation and not NB.
On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <[email protected]>wrote: > Hello, > > I'm giving a try to different classifiers for a classical problem of text > classification very close to the 20newsgroup one. > I end up with much better results with Weka NaiveBayesMultinomial than with > Mahout bayes. > The main problem comes from the fact that my data is unbalanced. I know > bayes has difficulties with that yet I'm surprised by the difference > between > weka and mahout. > > I went back to the 20newsgroup example, picked 5 classes only and > subsampled > those to get 5 classes with 400 200 100 100 and 30 examples and pretty much > the same for test set. > On mahout with bayes 1-gram, I'm getting 66% correctly classified (see > below > for confusion matrix) > On weka, on the same exact data, without any tuning, I'm getting 92% > correctly classified. > > Would anyone know where the difference comes from and if there are ways I > could tune Mahout to get better results? my data is small enough for now > for weka but this won't last. > > Many thanks > > Benjamin. > > > > MAHOUT: > ------------------------------------------------------- > Correctly Classified Instances : 491 65.5541% > Incorrectly Classified Instances : 258 34.4459% > Total Classified Instances : 749 > > ======================================================= > Confusion Matrix > ------------------------------------------------------- > a b c d e f <--Classified as > 14 82 0 4 0 0 | 100 a = > rec.sport.hockey > 0 319 0 0 0 0 | 319 b = > alt.atheism > 0 88 3 9 0 0 | 100 c = > rec.autos > 0 45 0 155 0 0 | 200 d = > comp.graphics > 0 25 0 5 0 0 | 30 e = > sci.med > 0 0 0 0 0 0 | 0 f = > unknown > Default Category: unknown: 5 > > > WEKA: > java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T > 20news_ss_test.arff -F > "weka.filters.unsupervised.attribute.StringToWordVector -S" -W > weka.classifiers.bayes.NaiveBayesMultinomial > > === Error on test data === > > Correctly Classified Instances 688 91.8558 % > Incorrectly Classified Instances 61 8.1442 % > Kappa statistic 0.8836 > Mean absolute error 0.0334 > Root mean squared error 0.1706 > Relative absolute error 11.9863 % > Root relative squared error 45.151 % > Total Number of Instances 749 > > > === Confusion Matrix === > > a b c d e <-- classified as > 308 9 2 0 0 | a = alt.atheism > 5 195 0 0 0 | b = comp.graphics > 3 11 84 2 0 | c = rec.autos > 3 3 0 94 0 | d = rec.sport.hockey > 6 11 6 0 7 | e = sci.med >
