Hi, I have a couple of questions regarding Naive Bayes classification in Mahout 0.7.
Is there a preferred way to determine when a document doesn't belong to any of the given categories? Currently, I'm trying to do this by explicitly having an "Other" category and including large numbers of documents in the training and testing that don't match any of the categories of interest. I'm getting pretty good results testing one category at a time against "Other". The quality drops fairly quickly, though, as I add more categories to the mix. As fast as the Naive Bayes algorithm is in Mahout 0.7, testing one category at a time might be feasible, but I'm hoping there might be a better and faster way. As I understand it, a trained model, especially when tfidf vectors are used, is specific to the corpus of documents and resulting dictionary that were vectorized together. As new, unclassified documents are added, how should they be handled? Does the the entire corpus need to be re-vectorized and new models trained or is there a more efficient way to incorporate and classify just the new documents? Thanks, David -- David Engel [email protected]
