Re: Mahout 0.6 Naive Bayes Accuracy

Dimitri Goldin Thu, 29 Mar 2012 03:29:27 -0700

Hi Isabel,

First of all, thanks for your reply.


On 03/28/2012 09:10 AM, Isabel Drost wrote:

On 27.03.2012 Dimitri Goldin wrote:

Having tried Mallets naive bayes implementation we achieved ~95%
accuracy without having to balance the training-data. Does anybody know
which implementation detail might cause this or why balance seems
influence mahouts implementation much more?


Without knowing the Mallet implementation: You describe that you tried using two
tokenizations for your Mahout runs - what are you using when running Mallet?

No "special" tokenization and/or stemming was set for mallet. Thedefault tokenizer matches (http://mallet.cs.umass.edu/import.php

see --token-regex) tokens using a regular expression.
We used the following one: "\p{L}+".

Which Naive Bayes implementation in Mahout did you use?


So far we used the regular Naive Bayes.

Did you also try running with the complementary naive bayes implementation or
the logistic regression instead?

I ran Complementary Naive Bayes on the same (unbalanced and morebalanced, same as previous tests) training sets and achieved around

the same results as with the regular Naive Bayes, worst of which was
also around ~30% with pretty unbalanced data (listing below).

For completeness sake, the class sizes in the "worst", unbalanced
training set:

1431 a
4117 b
5348 c
15967 d
2940 e
9095 f
15925 g
10736 h
4441 i

The assigned class still seems to gravitate around the largest class
from the training set, which would be 'd' from the list.

Yes, we evaluated the logistic regression (not the adaptive variant)encoding the features using LuceneTextValueEncoder, the GermanAnalyzerand a list of stopwords. The accuracy was ~82%, though

we did not compare it to any other implementations.

Thanks,
        Dimitry

Re: Mahout 0.6 Naive Bayes Accuracy

Reply via email to