Hello, I'm giving a try to different classifiers for a classical problem of text classification very close to the 20newsgroup one. I end up with much better results with Weka NaiveBayesMultinomial than with Mahout bayes. The main problem comes from the fact that my data is unbalanced. I know bayes has difficulties with that yet I'm surprised by the difference between weka and mahout.
I went back to the 20newsgroup example, picked 5 classes only and subsampled those to get 5 classes with 400 200 100 100 and 30 examples and pretty much the same for test set. On mahout with bayes 1-gram, I'm getting 66% correctly classified (see below for confusion matrix) On weka, on the same exact data, without any tuning, I'm getting 92% correctly classified. Would anyone know where the difference comes from and if there are ways I could tune Mahout to get better results? my data is small enough for now for weka but this won't last. Many thanks Benjamin. MAHOUT: ------------------------------------------------------- Correctly Classified Instances : 491 65.5541% Incorrectly Classified Instances : 258 34.4459% Total Classified Instances : 749 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f <--Classified as 14 82 0 4 0 0 | 100 a = rec.sport.hockey 0 319 0 0 0 0 | 319 b = alt.atheism 0 88 3 9 0 0 | 100 c = rec.autos 0 45 0 155 0 0 | 200 d = comp.graphics 0 25 0 5 0 0 | 30 e = sci.med 0 0 0 0 0 0 | 0 f = unknown Default Category: unknown: 5 WEKA: java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T 20news_ss_test.arff -F "weka.filters.unsupervised.attribute.StringToWordVector -S" -W weka.classifiers.bayes.NaiveBayesMultinomial === Error on test data === Correctly Classified Instances 688 91.8558 % Incorrectly Classified Instances 61 8.1442 % Kappa statistic 0.8836 Mean absolute error 0.0334 Root mean squared error 0.1706 Relative absolute error 11.9863 % Root relative squared error 45.151 % Total Number of Instances 749 === Confusion Matrix === a b c d e <-- classified as 308 9 2 0 0 | a = alt.atheism 5 195 0 0 0 | b = comp.graphics 3 11 84 2 0 | c = rec.autos 3 3 0 94 0 | d = rec.sport.hockey 6 11 6 0 7 | e = sci.med
