To do a decent job of off-line evaluation of a system, you need to stratify
against several conditions:

a.1) users that are part of the training

a.2) users that arrive after training

b.1) users with lots of ratings (available to the recommender)

b.2) users with few (available to the recommender)

b.3) cold-start users

I prefer always to segment training and test data by time, but this is not
always possible.

As far as metrics are concerned, I dislike RMSE and precision and recall for
evaluating realistic recommenders.  They are useful for reference to the
academic literature, but are not such good indicators of practical success.
 I prefer precision@20 or some similar first page metric.

For a real system, the biggest impacts on performance that I have seen are
due to:

a) better diversity on the results page to improve the portofolio effect.

b) dithering so that results appear dynamic and so that second page results
show up on the first page.  This is especially important after the user has
eliminated the good recommendations on the first page.

c) fast response of the system to real-world trends.

None of these are captured by the off-line metrics you are suggesting.

On Thu, Sep 8, 2011 at 8:56 AM, James James <[email protected]>wrote:

> Hi,
>
>
> I've got a question regarding how to split data (e.g. MovieLens) into
> training and testing data when I want to test the performance of CF-based
> recommender. In particular, I want to focus on the metrics including RMSE,
> precision and recall (for precision and recall, we convert any ratings
> higher than 3 to LIKE and anything else DISLIKE). If for each user, we
> randomly split his data by a ration of 8:2 (80% for training and 20% for
> testing), then we may end up with scenario where some of the items (e.g.
> movies) in the test data fail to appear in the training data. Due to the
> cold-start item issue, the CF-based recommender will not be able to predict
> a rating for such items. However, this is not issue for content-based
> recommender which is able to predict a rating for any items.
>
>
> I was wondering how people usually go about this issue when they want to
> compare the performance of a CF-based recommender and a content-based
> recommender on the metrics such as RMSE, precision and recall. Do they
> simply eliminate these items (in test data, but not in training data) from
> evaluation on CF-based recommender or do they have to make sure that each
> item appear in both training and test data so that CF can make prediction on
> every item in the test data?
>
>
> Thanks,
>
> James

Reply via email to