The best tests are really from real users. A/B test different
recommenders and see which has better performance. That's not quite
practical though.

The problem is that you don't even know what the best recommendations
are. Splitting the data by date is reasonable, but recent items aren't
necessarily most-liked. Splitting by rating is more reasonable on this
point, but you still can't conclude that there aren't better
recommendations from among the un-rated items.

Still it out to correlate. I think you will find precision/recall are
very low in most cases, often a few percent. The result is "noisy".
AUC will tell you about where all of those "best recommendations" in
the test set fell into the list, rather than only measuring the top
N's performance. This tells you more, and I think that's generally
good. However it is measuring performance over the entire list of
recs, when you are unlikely to use more than the top N.

Go ahead and use it since there's not a lot better you can do in the
lab, but be aware of the issues.

Reply via email to