Hi Isabel,

First of all, thanks for your reply.

On 03/28/2012 09:10 AM, Isabel Drost wrote:
On 27.03.2012 Dimitri Goldin wrote:
Having tried Mallets naive bayes implementation we achieved ~95%
accuracy without having to balance the training-data. Does anybody know
which implementation detail might cause this or why balance seems
influence mahouts implementation much more?

Without knowing the Mallet implementation: You describe that you tried using two
tokenizations for your Mahout runs - what are you using when running Mallet?

No "special" tokenization and/or stemming was set for mallet. The default tokenizer matches (http://mallet.cs.umass.edu/import.php
see --token-regex) tokens using a regular expression.
We used the following one: "\p{L}+".

Which Naive Bayes implementation in Mahout did you use?

So far we used the regular Naive Bayes.

Did you also try running with the complementary naive bayes implementation or
the logistic regression instead?

I ran Complementary Naive Bayes on the same (unbalanced and more balanced, same as previous tests) training sets and achieved around
the same results as with the regular Naive Bayes, worst of which was
also around ~30% with pretty unbalanced data (listing below).

For completeness sake, the class sizes in the "worst", unbalanced
training set:

1431 a
4117 b
5348 c
15967 d
2940 e
9095 f
15925 g
10736 h
4441 i

The assigned class still seems to gravitate around the largest class
from the training set, which would be 'd' from the list.

Yes, we evaluated the logistic regression (not the adaptive variant) encoding the features using LuceneTextValueEncoder, the GermanAnalyzer and a list of stopwords. The accuracy was ~82%, though
we did not compare it to any other implementations.

Thanks,
        Dimitry



Reply via email to