If the recommendation will only produce binary output scores and you have
actual held out user data, then you can still compute AUC.  If you want to
compute log-likelihood, you need to compute probabilities p_1 and p_2 that
represent what the recommender *should* have said when it actually said 0 or
1.  You can adapt these to give optimum log-likelihood on one held out set
and then get a real value for log-likelihood on another held out set.

Precision, recall, false positive rate are also possibly useful.

If the engine has an internal threshold knob, you can build ROC curves and
estimate AUC using averaging.

But the question remains, why would use such a recommendation engine?

On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington <
[email protected]> wrote:

> Does anyone have a suggestion for how to evaluate a recommendation engine
> that uses a binary rating system?
> Usually the R scores (similarity score * rating of other items) are
> normalized by dividing by the sum of all rated similarity scores.  If I do
> this for a binary scoring system I would get 1.0 for every item.
>
> Is there another normalization I can do to get a number between 0 and 1.0?
> Should I just use precision and recall?
>
> Thanks for the help,
> Peter Harrington
>

Reply via email to