I think that these statements are subject to substantial debate.

Regarding the creation of a metric that captures important characteristics,
it isn't that hard.  For instance, you can simply measure the estimation
error on the items that appear in the top 50 of either reference or actual
recommendations.  Likewise, you can use a truncated rank metric that
measures the ranks at which reference items appear.  Neither of these
measures is difficult to implement and both are better than mean squared
error or other measurements that require all items to be scored.

Capturing the diversity measurement *is* really difficult and the only way
that I know to do that is to measure a real system on real users.  Beyond
the limits of a fairly simple recommendation system, UI factors such as how
you present previously recommended items, how many items are shown and what
expectations you set with users will make far more difference than small
algorithmic changes.  In fact, in a real system that is subject to the
circular feedback that such systems are prone to, a system that degrades
recommendation results with random noise can, in fact, provide a better user
experience than one which presents unadulterated results.

On Fri, Aug 13, 2010 at 9:06 PM, Yanir Seroussi <yanir.serou...@gmail.com>wrote:

> I still think that it's worthwhile to compare rating errors of different
> algorithms on all items, because in general it is likely that the more
> accurate algorithm will generate better recommendations. Comparing the
> actual generated recommendations is not as simple as comparing rating
> errors, though it is very important. As Ted said, you need to worry about
> the ordering of the top few items, and you typically don't want all the
> recommended items to be too similar to each other or too obvious. However,
> capturing all these qualities in a single metric is quite hard, if not
> impossible.
>

Reply via email to