Hi, We were evaluating Mahout 0.6's Naive Bayes implementation using a training set of 70000 documents (we know, that with this amount of documents distributed training does not yet make too much sense).
During the tests we noticed that the performance is around 80% with the 20newsgroups data - which is quite balanced (in the sense that there are approximately the same number of documents per class). Most documents tended to be classified as the class with the most number of training documents.
Using our own data we only achieved an accuracy between ~35% and ~55% depending on the classes' sizes within the test-sets. We also tested replacing the tokenization, which right now is performed on tabs and spaces using guavas Splitter class, which we replaced with Lucenes GermanAnalyzer. This gave us around 10% more accuracy with balanced training-data, resulting in ~89% accuracy. Having tried Mallets naive bayes implementation we achieved ~95% accuracy without having to balance the training-data. Does anybody know which implementation detail might cause this or why balance seems influence mahouts implementation much more? I also found the following thread from fall 2011, which seems to describe a similar problem: http://search-lucene.com/m/DLzRcMLnWM Unfortunately there was no follow-up to this, but maybe someone already solved it. Thanks in advance, Dimitry
