Once you learn the model, the scores should be roughly comparable for
documents of the same length.  If you have all short docs like your examples
here, you can probably use percentile rank for the score for a particular
category and document length as  your measure of quality.  The conditioning
on document length may also not be necessary, but you should experiment with
that.  The rationale for that last is that long documents really are less
ambiguous so normalizing that away may be unnecessary.

On Thu, Dec 16, 2010 at 12:36 PM, Sumedh Mungee <[email protected]> wrote:

> Hi,
>
> I read that the score reported by the cbayes classifier is not a
> probability
> and is only useful for relative ranking, but is there a way to compare or
> normalize scores across classifications?
>
> Basically I'm looking for a way to weed out the low-probability matches..
>
> For instance, if I get the following classifications:
> "apple, red" --> Fruit, Score == 10.39
> "apple, white" --> Laptop, Score == 12.33
> "red" --> Fruit, Score == 3.444
>
> I want to be able to weed out the last "red" --> Fruit classification,
> because the score is "too low".
>
> Hope my question makes sense.
>
> (First post here. Wonderful work by the Mahout team!)
>
> Thanks!
>
> ~sumedh
> (Mahout 0.4; 4.5 million documents; 200+ labels)
>

Reply via email to