Hi Ted, Thanks! I will experiment with the percentile rank idea and post the results here if I find anything interesting.
~sumedh On Thu, Dec 16, 2010 at 4:46 PM, Ted Dunning <[email protected]> wrote: > Once you learn the model, the scores should be roughly comparable for > documents of the same length. If you have all short docs like your > examples > here, you can probably use percentile rank for the score for a particular > category and document length as your measure of quality. The conditioning > on document length may also not be necessary, but you should experiment > with > that. The rationale for that last is that long documents really are less > ambiguous so normalizing that away may be unnecessary. > > On Thu, Dec 16, 2010 at 12:36 PM, Sumedh Mungee <[email protected]> wrote: > > > Hi, > > > > I read that the score reported by the cbayes classifier is not a > > probability > > and is only useful for relative ranking, but is there a way to compare or > > normalize scores across classifications? > > > > Basically I'm looking for a way to weed out the low-probability matches.. > > > > For instance, if I get the following classifications: > > "apple, red" --> Fruit, Score == 10.39 > > "apple, white" --> Laptop, Score == 12.33 > > "red" --> Fruit, Score == 3.444 > > > > I want to be able to weed out the last "red" --> Fruit classification, > > because the score is "too low". > > > > Hope my question makes sense. > > > > (First post here. Wonderful work by the Mahout team!) > > > > Thanks! > > > > ~sumedh > > (Mahout 0.4; 4.5 million documents; 200+ labels) > > >
