------ Robin Anil
On Fri, Jul 20, 2012 at 3:27 PM, David Engel <[email protected]> wrote: > Hi, > > I have a couple of questions regarding Naive Bayes classification in > Mahout 0.7. > > Is there a preferred way to determine when a document doesn't belong > to any of the given categories? Currently, I'm trying to do this by > explicitly having an "Other" category and including large numbers of > documents in the training and testing that don't match any of the > categories of interest. I'm getting pretty good results testing one > category at a time against "Other". The quality drops fairly quickly, > though, as I add more categories to the mix. As fast as the Naive > Bayes algorithm is in Mahout 0.7, testing one category at a time might > be feasible, but I'm hoping there might be a better and faster way. > try using CNB version for a better multiclass with other. > > As I understand it, a trained model, especially when tfidf vectors are > used, is specific to the corpus of documents and resulting dictionary > that were vectorized together. As new, unclassified documents are > added, how should they be handled? Does the the entire corpus need to > be re-vectorized and new models trained or is there a more efficient > way to incorporate and classify just the new documents? > Yes do not use tfidf based encoder, use the randomized encoder (EncodedVectorsFromSequenceFiles.java or bin/mahout seqencoded" See the 20 newsgroups example shell script in <mahout>/examples/bin > > Thanks, > David > -- > David Engel > [email protected] >
