Peter (/Ted), Yes this is all answered in the framework already. You would never directly use the recommenders intended for data sets with ratings, as most don't make sense when all ratings are 1.0. You would use, for example, GenericBooleanPrefItemBasedRecommender, a variant on GenericItemBasedRecommender, which overloads the notion of "estimatePreference()" to still return a useful value.
There is already GenericRecommenderIRStatsEvaluator which runs precision, recall, f-score and NDCG stats on a recommender. These are meaningful even without ratings, though of course things like RMSE aren't anymore. (This is all in Mahout in Action too, yes.) The output of a recommender or similarity metric isn't a probability in general, so you can't apply AUC in all cases, so this is not implemented in general. However yes for the case of LogLikelihoodSimilarity you could manage to put that together. On Tue, Apr 26, 2011 at 1:50 AM, Ted Dunning <[email protected]> wrote: > If the recommendation will only produce binary output scores and you have > actual held out user data, then you can still compute AUC. If you want to > compute log-likelihood, you need to compute probabilities p_1 and p_2 that > represent what the recommender *should* have said when it actually said 0 > or > 1. You can adapt these to give optimum log-likelihood on one held out set > and then get a real value for log-likelihood on another held out set. > > Precision, recall, false positive rate are also possibly useful. > > If the engine has an internal threshold knob, you can build ROC curves and > estimate AUC using averaging. > > But the question remains, why would use such a recommendation engine? > > On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington < > [email protected]> wrote: > > > Does anyone have a suggestion for how to evaluate a recommendation engine > > that uses a binary rating system? > > Usually the R scores (similarity score * rating of other items) are > > normalized by dividing by the sum of all rated similarity scores. If I > do > > this for a binary scoring system I would get 1.0 for every item. > > > > Is there another normalization I can do to get a number between 0 and > 1.0? > > Should I just use precision and recall? > > > > Thanks for the help, > > Peter Harrington > > >
