This is a really tiny training set.  NB works much better with larger data
sets.  This pattern of performing much better on training data than on test
data indicates that the small data set is giving you problems.  This could
be over-fitting but it is likely also exacerbated by the number of unknown
words being encountered.

My own tendency would be to use L1 regularized logistic regression on this.
 In R, glmnet is an excellent choice in that it gives you the chance to use
cross validation to determine expected performance.

On Sat, Jul 7, 2012 at 1:48 PM, Alexander Aristov <
[email protected]> wrote:

> People,
>
> I am implementing Naive Bayes classifier on my text data and get poor
> results.
>
> Self-Testing on trained data gives 95% pos and 5% neg results (not bad).
> But testing on hold out set gives 60-40% that is not good for me.
>
> I tried to play with vectorizer arguments but setting them randomly makes
> results only worse. I have 7 categories and about 20-90 docs per category.
>
> What can you suggest me to do to improve results? Tried complementary NB
> alg but it gives approximately the same results.
>
> I use mahout trunk version 0.8.
>
> Best Regards
> Alexander Aristov
>

Reply via email to