On Sat, Jul 21, 2012 at 11:51:02AM -0500, Robin Anil wrote: > ------ > Robin Anil > > > On Fri, Jul 20, 2012 at 3:27 PM, David Engel <[email protected]> wrote: > > > Hi, > > > > I have a couple of questions regarding Naive Bayes classification in > > Mahout 0.7. > > > > Is there a preferred way to determine when a document doesn't belong > > to any of the given categories? Currently, I'm trying to do this by > > explicitly having an "Other" category and including large numbers of > > documents in the training and testing that don't match any of the > > categories of interest. I'm getting pretty good results testing one > > category at a time against "Other". The quality drops fairly quickly, > > though, as I add more categories to the mix. As fast as the Naive > > Bayes algorithm is in Mahout 0.7, testing one category at a time might > > be feasible, but I'm hoping there might be a better and faster way. > > > try using CNB version for a better multiclass with other.
I've actually been using CNB mainly instead of plain NB. Sorry for not stating that earlier. How do I determine when an item matches no or more than one class? TestNaiveBayesDriver picks at most one class -- the one with the highest score. I guess TestNaiveBayesDriver could, in theory, pick no class if none of the scores are above Long.MIN_VALUE, but I don't believe I've ever seen that. Does it happen in practice? > > As I understand it, a trained model, especially when tfidf vectors are > > used, is specific to the corpus of documents and resulting dictionary > > that were vectorized together. As new, unclassified documents are > > added, how should they be handled? Does the the entire corpus need to > > be re-vectorized and new models trained or is there a more efficient > > way to incorporate and classify just the new documents? > > > Yes do not use tfidf based encoder, use the randomized encoder > (EncodedVectorsFromSequenceFiles.java or bin/mahout seqencoded" > > See the 20 newsgroups example shell script in <mahout>/examples/bin Ah, I didn't think NB could use vectors made from seq2encoded. In my testing, I focused mainly on the 20newsgroups example which only used seq2sparse and tfidf vectores for NB/CNB and feature encoded vectors for SGD. I should have realized a feature vector is a feature vector is a feature vector. David -- David Engel [email protected]
