Hi,

I have a couple of questions regarding Naive Bayes classification in
Mahout 0.7.

Is there a preferred way to determine when a document doesn't belong
to any of the given categories?  Currently, I'm trying to do this by
explicitly having an "Other" category and including large numbers of
documents in the training and testing that don't match any of the
categories of interest.  I'm getting pretty good results testing one
category at a time against "Other".  The quality drops fairly quickly,
though, as I add more categories to the mix.  As fast as the Naive
Bayes algorithm is in Mahout 0.7, testing one category at a time might
be feasible, but I'm hoping there might be a better and faster way.

As I understand it, a trained model, especially when tfidf vectors are
used, is specific to the corpus of documents and resulting dictionary
that were vectorized together.  As new, unclassified documents are
added, how should they be handled?  Does the the entire corpus need to
be re-vectorized and new models trained or is there a more efficient
way to incorporate and classify just the new documents?

Thanks,
David
-- 
David Engel
[email protected]

Reply via email to