On Jun 21, 2012, at 4:48 PM, Sean Owen wrote: > On Thu, Jun 21, 2012 at 9:01 PM, Nimrod Priell <[email protected]> > wrote: >> On a completely different subject: I wrote a simple >> RelevantItemsDataSplitter and RecommenderIRStatsEvaluator which take a list >> of item IDs, and run CF evaluation by hiding items only out of that list, >> and asking to recommend only out of that list of items (precision and recall >> are then also calculated only with that list of items as the universe). > > Sure, if you know what the 'right answers' are more specifically in > your use case, you can and should use that in the test. That's what > the splitter class is for and that's what you did, yes. > > The more important thing of course is to implement this in your actual > recommender! you can use a Rescorer to penalize popular items, if > that's what you believe improves the result quality. >
In my real-time context, the recommender will never be asked to recommend for items outside of the specified set. Hence I want to evaluate only on it. The reference to popularity is a little confusing, actually; it just so happens that I realized I should do the scoring differently because the items in my set are less popular, and I didn't see any improvement from UBCF until I compared the results on these less popular items; rather than hit items at random for which the popularity recommender does best. Cool. I'll try to look into submitting a patch this weekend and maybe others could gain from this. > >> I realize an alternative to the example I proposed with the popularity is >> looking at the top-n recommendation for large n because only relatively few >> items are very popular so the precision-recall stats based on popularity >> become less skewed; But I still think it's a useful constraint for >> evaluation. > > You mean you want to use as a large "at" value in your test? This > tends to increase recall but decrease precision. I don't know if it > (necessarily) fixes something in this regard. What I do is I scatter-plot top-n for every n (say, top-1, top-5, top-10, ...) on a precision-recall space (or just compare F1 across several values of n). Then user-based CF is comparable to and even worse than "most popular" recommendation (in my test). However, as n becomes larger, "popular recommendation" has no "personalization" and starts missing, compared to UBCF so it has a "lower profile" than UBCF the farther out I go in recommendations. For an example (where I took this idea from), see the plot on top of page 22 of http://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf , a guide to the R recommenderlab package. My rationalization for this was that because 80% of the users have the most popular item, it is picked somewhat more often to be the hidden item, and it's very easy for the popular recommender to guess it right. When I ask the recommender for many items, the effect of the popular ones dampens. Does that make sense? Best, Nimrod
