------
Robin Anil

On Fri, Jul 20, 2012 at 3:27 PM, David Engel <[email protected]> wrote:

> Hi,
>
> I have a couple of questions regarding Naive Bayes classification in
> Mahout 0.7.
>
> Is there a preferred way to determine when a document doesn't belong
> to any of the given categories?  Currently, I'm trying to do this by
> explicitly having an "Other" category and including large numbers of
> documents in the training and testing that don't match any of the
> categories of interest.  I'm getting pretty good results testing one
> category at a time against "Other".  The quality drops fairly quickly,
> though, as I add more categories to the mix.  As fast as the Naive
> Bayes algorithm is in Mahout 0.7, testing one category at a time might
> be feasible, but I'm hoping there might be a better and faster way.
>
try using CNB version for a better multiclass with other.

>
> As I understand it, a trained model, especially when tfidf vectors are
> used, is specific to the corpus of documents and resulting dictionary
> that were vectorized together.  As new, unclassified documents are
> added, how should they be handled?  Does the the entire corpus need to
> be re-vectorized and new models trained or is there a more efficient
> way to incorporate and classify just the new documents?
>
Yes do not use tfidf based encoder, use the randomized encoder
(EncodedVectorsFromSequenceFiles.java or bin/mahout seqencoded"

See the 20 newsgroups example shell script in <mahout>/examples/bin


>
> Thanks,
> David
> --
> David Engel
> [email protected]
>

Reply via email to