Funny, I was just doing a similar thing using the ASF email archives.  My 
initial run, which tried to classify per mailing list gave pretty low 
performance (61%), but then when I did per project, I got quite high 
performance.  In my case, I think there was too much overlap between mailing 
lists (user vs. dev).   I was using 10K examples per project.

That being said, I can't speak to the differences you are seeing, as I haven't 
looked at the Weka code (and likely won't).   Can you put up the code/commands 
you did to generate the Mahout test?


On Sep 16, 2011, at 8:00 AM, Benjamin Rey wrote:

> Hello,
> 
> I'm giving a try to different classifiers for a classical problem of text
> classification very close to the 20newsgroup one.
> I end up with much better results with Weka NaiveBayesMultinomial than with
> Mahout bayes.
> The main problem comes from the fact that my data is unbalanced. I know
> bayes has difficulties with that yet I'm surprised by the difference between
> weka and mahout.
> 
> I went back to the 20newsgroup example, picked 5 classes only and subsampled
> those to get 5 classes with 400 200 100 100 and 30 examples and pretty much
> the same for test set.
> On mahout with bayes 1-gram, I'm getting 66% correctly classified (see below
> for confusion matrix)
> On weka, on the same exact data, without any tuning, I'm getting 92%
> correctly classified.
> 
> Would anyone know where the difference comes from and if there are ways I
> could tune Mahout to get better results? my data is  small enough for now
> for weka but this won't last.
> 
> Many thanks
> 
> Benjamin.
> 
> 
> 
> MAHOUT:
> -------------------------------------------------------
> Correctly Classified Instances          :        491       65.5541%
> Incorrectly Classified Instances        :        258       34.4459%
> Total Classified Instances              :        749
> 
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a        b        c        d        e        f        <--Classified as
> 14       82       0        4        0        0         |  100       a     =
> rec.sport.hockey
> 0        319      0        0        0        0         |  319       b     =
> alt.atheism
> 0        88       3        9        0        0         |  100       c     =
> rec.autos
> 0        45       0        155      0        0         |  200       d     =
> comp.graphics
> 0        25       0        5        0        0         |  30        e     =
> sci.med
> 0        0        0        0        0        0         |  0         f     =
> unknown
> Default Category: unknown: 5
> 
> 
> WEKA:
> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
> 20news_ss_test.arff -F
> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
> weka.classifiers.bayes.NaiveBayesMultinomial
> 
> === Error on test data ===
> 
> Correctly Classified Instances         688               91.8558 %
> Incorrectly Classified Instances        61                8.1442 %
> Kappa statistic                          0.8836
> Mean absolute error                      0.0334
> Root mean squared error                  0.1706
> Relative absolute error                 11.9863 %
> Root relative squared error             45.151  %
> Total Number of Instances              749
> 
> 
> === Confusion Matrix ===
> 
>   a   b   c   d   e   <-- classified as
> 308   9   2   0   0 |   a = alt.atheism
>   5 195   0   0   0 |   b = comp.graphics
>   3  11  84   2   0 |   c = rec.autos
>   3   3   0  94   0 |   d = rec.sport.hockey
>   6  11   6   0   7 |   e = sci.med

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Reply via email to