I think that these statements are subject to substantial debate. Regarding the creation of a metric that captures important characteristics, it isn't that hard. For instance, you can simply measure the estimation error on the items that appear in the top 50 of either reference or actual recommendations. Likewise, you can use a truncated rank metric that measures the ranks at which reference items appear. Neither of these measures is difficult to implement and both are better than mean squared error or other measurements that require all items to be scored.
Capturing the diversity measurement *is* really difficult and the only way that I know to do that is to measure a real system on real users. Beyond the limits of a fairly simple recommendation system, UI factors such as how you present previously recommended items, how many items are shown and what expectations you set with users will make far more difference than small algorithmic changes. In fact, in a real system that is subject to the circular feedback that such systems are prone to, a system that degrades recommendation results with random noise can, in fact, provide a better user experience than one which presents unadulterated results. On Fri, Aug 13, 2010 at 9:06 PM, Yanir Seroussi <yanir.serou...@gmail.com>wrote: > I still think that it's worthwhile to compare rating errors of different > algorithms on all items, because in general it is likely that the more > accurate algorithm will generate better recommendations. Comparing the > actual generated recommendations is not as simple as comparing rating > errors, though it is very important. As Ted said, you need to worry about > the ordering of the top few items, and you typically don't want all the > recommended items to be too similar to each other or too obvious. However, > capturing all these qualities in a single metric is quite hard, if not > impossible. >