Hi,

We were evaluating Mahout 0.6's Naive Bayes implementation using a
training set of 70000 documents (we know, that with this amount of
documents distributed training does not yet make too much sense).

During the tests we noticed that the performance is around 80% with the 20newsgroups data - which is quite balanced (in the sense that there are approximately the same number of documents per class). Most documents tended to be classified as the class with the most number of training documents.

Using our own data we only achieved an accuracy between ~35% and ~55%
depending on the classes' sizes within the test-sets.
We also tested replacing the tokenization, which right now is performed
on tabs and spaces using guavas Splitter class, which we replaced with
Lucenes GermanAnalyzer. This gave us around 10% more accuracy with
balanced training-data, resulting in ~89% accuracy.

Having tried Mallets naive bayes implementation we achieved ~95%
accuracy without having to balance the training-data. Does anybody know
which implementation detail might cause this or why balance seems
influence mahouts implementation much more?

I also found the following thread from fall 2011, which seems to
describe a similar problem:
http://search-lucene.com/m/DLzRcMLnWM

Unfortunately there was no follow-up to this, but maybe someone already
solved it.

Thanks in advance,
    Dimitry

Reply via email to