What I was trying to say is that creating a metric that captures *all* the
desired characteristics in a single score is hard.

I agree that it's not hard to come up with metrics that measure the quality
of a recommender's top-K recommendations, but I think that's one of the
problems -- there are many metrics out there and no consensus (as far as I
know) about which metric is the "best", and that makes it hard to compare
recommendation algorithms. However, I think that measuring the MAE/RMSE on
all the available test item/user preferences is still worthwhile, and can
give an indication of the overall performance of the recommender. Evidence
for this is given in Koren's "Factorization Meets the Neighborhood" (
http://research.yahoo.com/files/kdd08koren.pdf). He reported the results in
terms of RMSE and then looked at the results for top-K recommendations
(using his own metric...), and found that there's a strong link between
small improvements in RMSE and large improvements in top-K recommendation
quality:

"We are encouraged, even somewhat surprised, by the results. It is evident
that small improvements in RMSE translate into significant improvements in
quality of the top K products. In fact, based on RMSE differences, we did
not expect the integrated model to deliver such an emphasized improvement in
the test."

I agree that the overall performance of the system is probably more affected
by UI factors than by the accuracy of the underlying recommender, but these
two aspects can be developed separately. I think that the choice between
using a recommendation algorithm that magically manages to always generate
recommendations that are diverse, novel and accurate, and an algorithm that
sometimes fails to do that is obvious. So while bad UI design choices may
make a perfect recommendation algorithm useless, a good UI is likely to lead
to greater user satisfaction when a better recommendation algorithm is used,
and better may sometimes mean more accurate, all other things being equal.

On Sat, Aug 14, 2010 at 16:30, Ted Dunning <[email protected]> wrote:

> I think that these statements are subject to substantial debate.
>
> Regarding the creation of a metric that captures important characteristics,
> it isn't that hard.  For instance, you can simply measure the estimation
> error on the items that appear in the top 50 of either reference or actual
> recommendations.  Likewise, you can use a truncated rank metric that
> measures the ranks at which reference items appear.  Neither of these
> measures is difficult to implement and both are better than mean squared
> error or other measurements that require all items to be scored.
>
> Capturing the diversity measurement *is* really difficult and the only way
> that I know to do that is to measure a real system on real users.  Beyond
> the limits of a fairly simple recommendation system, UI factors such as how
> you present previously recommended items, how many items are shown and what
> expectations you set with users will make far more difference than small
> algorithmic changes.  In fact, in a real system that is subject to the
> circular feedback that such systems are prone to, a system that degrades
> recommendation results with random noise can, in fact, provide a better
> user
> experience than one which presents unadulterated results.
>
> On Fri, Aug 13, 2010 at 9:06 PM, Yanir Seroussi <[email protected]
> >wrote:
>
> > I still think that it's worthwhile to compare rating errors of different
> > algorithms on all items, because in general it is likely that the more
> > accurate algorithm will generate better recommendations. Comparing the
> > actual generated recommendations is not as simple as comparing rating
> > errors, though it is very important. As Ted said, you need to worry about
> > the ordering of the top few items, and you typically don't want all the
> > recommended items to be too similar to each other or too obvious.
> However,
> > capturing all these qualities in a single metric is quite hard, if not
> > impossible.
> >
>

Reply via email to