Hi Toyoharu, Mahout Naive Bayes uses Laplace smoothing (the alpha_I parameter with default=1) to deal with terms unseen by the training set. See Rennie et al. sec. 2.3 [1].
Your modification will certainly work, and may in fact give better results for the problem that your working on. You could also look at optimizing the Laplacian [2]. [1] http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf [2] http://www.stat.yale.edu/~lc436/papers/temp/Zhang_Oles_2001.pdf Andy > Date: Sun, 22 Jun 2014 00:41:51 +0900 > Subject: Naive Bayes Classifier Bug ? > From: [email protected] > To: [email protected] > > Hi Mahout, > > In Naive Bayes, I think that a term does not exist in a training data > should not affect a score. > What do you think? > > org.apache.mahout.classifier. > naivebayes.AbstractNaiveBayesClassifier > > Before: > protected double getScoreForLabelInstance(int label, Vector instance) { > double result = 0.0; > for (Element e : instance.nonZeroes()) { > result += e.get() * getScoreForLabelFeature(label, e.index()); > } > return result; > } > > After: > protected double getScoreForLabelInstance(int label, Vector instance) { > double result = 0.0; > for (Element e : instance.nonZeroes()) { > int index = e.index(); > double featureWeight = model.featureWeight(index); > if( featureLabelWeight != 0 ) { > result += e.get() * getScoreForLabelFeature(label, index); > } > } > return result; > } > > Thanks, > Toyoharu
