Now that I've actually read the contest page (useful, that is), I see that Track 2 does something like what we are talking about. It holds out some highly rated items as "good" recommendations and asks the system to distinguish them from random other unrated items.
This mostly avoids the theoretical issues with a precision/recall-style test for top k recommendations. To answer your interesting question here -- no of course you do not have ratings for all items from all users. Users tend to rate things they really like, and really don't like. If you were to take out the top k ratings (and that's not what Track 2 does, quite), the resulting training data would be systematically un-like real ratings. It's not the same as removing k random items. The effect on a recommender is probably negative, but, perhaps only very marginally. If a user rates items 1 to 10, and we hold out 6 to 10 from the training data, you can create a precision@5 test by seeing how much of 6 to 10 is recommended to the user. But, the problem is that we don't know 6-10 are the best recommendations -- they're probably good recommendations, but not necessarily the best ones. If there were an item 11 that the user actually would like more, and the recommender recommends it, it would be penalized in this test. That's the problem. It's not a meaningless test but has certain drawbacks. The Track 2 test formulation doesn't have this particular issue. On Tue, Feb 15, 2011 at 6:34 PM, Chen_1st <[email protected]> wrote: > Hi, Sean, > > Sorry for my poor English. > >>>Hmm, not sure I understand. No, it's not true that real-life data >>>regularly omits the user's top ratings. Why would that be? > > In reallife applications, it's impossible for users to provide ratings for > all their favoriate tracks, right? It's the same effect as omitting some top > rated tracks. > >>>How would you score the recommendations by holding out a random >>>subset? That subset is definitely *not* representative of good >>>recommendations -- you might be picking out things the user hates. > > Consider the example: the top favoriate tracks of the user are complete_set > = {1, 2, ..., 10}, and user only provide ratings on randomly_selected_subset > = {1, 2, ..., 5}, here we assume the user randomly selected 5 tracks from > the complete_set and rated them. Let the recommender system predict top 5 > tracks for the user, if it can correctly hit 3 in randomly_selected_subset, > it's with high probability better than hit only 1, > > The above is the illustration how to apply recall@5. Precision and NDCG are > similar. > 2011/2/16 Sean Owen <[email protected]> > >> Hmm, not sure I understand. No, it's not true that real-life data >> regularly omits the user's top ratings. Why would that be? >> >> How would you score the recommendations by holding out a random >> subset? That subset is definitely *not* representative of good >> recommendations -- you might be picking out things the user hates. >> >> Precision / recall don't really make sense unless you think you're >> holding out "good" recommendations and those would have to be top >> rated items. >> >> Sean >>
