Once you learn the model, the scores should be roughly comparable for documents of the same length. If you have all short docs like your examples here, you can probably use percentile rank for the score for a particular category and document length as your measure of quality. The conditioning on document length may also not be necessary, but you should experiment with that. The rationale for that last is that long documents really are less ambiguous so normalizing that away may be unnecessary.
On Thu, Dec 16, 2010 at 12:36 PM, Sumedh Mungee <[email protected]> wrote: > Hi, > > I read that the score reported by the cbayes classifier is not a > probability > and is only useful for relative ranking, but is there a way to compare or > normalize scores across classifications? > > Basically I'm looking for a way to weed out the low-probability matches.. > > For instance, if I get the following classifications: > "apple, red" --> Fruit, Score == 10.39 > "apple, white" --> Laptop, Score == 12.33 > "red" --> Fruit, Score == 3.444 > > I want to be able to weed out the last "red" --> Fruit classification, > because the score is "too low". > > Hope my question makes sense. > > (First post here. Wonderful work by the Mahout team!) > > Thanks! > > ~sumedh > (Mahout 0.4; 4.5 million documents; 200+ labels) >
