On Sat, Jul 21, 2012 at 11:51:02AM -0500, Robin Anil wrote:
> ------
> Robin Anil
> 
> 
> On Fri, Jul 20, 2012 at 3:27 PM, David Engel <[email protected]> wrote:
> 
> > Hi,
> >
> > I have a couple of questions regarding Naive Bayes classification in
> > Mahout 0.7.
> >
> > Is there a preferred way to determine when a document doesn't belong
> > to any of the given categories?  Currently, I'm trying to do this by
> > explicitly having an "Other" category and including large numbers of
> > documents in the training and testing that don't match any of the
> > categories of interest.  I'm getting pretty good results testing one
> > category at a time against "Other".  The quality drops fairly quickly,
> > though, as I add more categories to the mix.  As fast as the Naive
> > Bayes algorithm is in Mahout 0.7, testing one category at a time might
> > be feasible, but I'm hoping there might be a better and faster way.
> >
> try using CNB version for a better multiclass with other.

I've actually been using CNB mainly instead of plain NB.  Sorry for
not stating that earlier.

How do I determine when an item matches no or more than one class?
TestNaiveBayesDriver picks at most one class -- the one with the
highest score.  I guess TestNaiveBayesDriver could, in theory, pick no
class if none of the scores are above Long.MIN_VALUE, but I don't
believe I've ever seen that.  Does it happen in practice?

> > As I understand it, a trained model, especially when tfidf vectors are
> > used, is specific to the corpus of documents and resulting dictionary
> > that were vectorized together.  As new, unclassified documents are
> > added, how should they be handled?  Does the the entire corpus need to
> > be re-vectorized and new models trained or is there a more efficient
> > way to incorporate and classify just the new documents?
> >
> Yes do not use tfidf based encoder, use the randomized encoder
> (EncodedVectorsFromSequenceFiles.java or bin/mahout seqencoded"
> 
> See the 20 newsgroups example shell script in <mahout>/examples/bin

Ah, I didn't think NB could use vectors made from seq2encoded.  In my
testing, I focused mainly on the 20newsgroups example which only used
seq2sparse and tfidf vectores for NB/CNB and feature encoded vectors
for SGD.  I should have realized a feature vector is a feature vector
is a feature vector.

David
-- 
David Engel
[email protected]

Reply via email to