> The affect of downweighting the popular items is very similar to
> removing them from recommendations so I still suspect precision will
> go down using IDF. Obviously this can pretty easily be tested, I just
> wondered if anyone had already done it.
>
> This brings up a problem with holdout based precision. It measures
> the value of a model trained on a training set in predicting
> something that is in the holdout set. This may or may not correlate
> with affecting user behavior.

Indeed. The problem with holdout sets is that they only indicate what users did with certain items. There's no way to know what they would have done with items they were not exposed to.

>
> To use purchases as preference indicators, a precision metric would
> measure how well purchases in the trianing set predicted purchases in
> the test set. If IDF lowers precision, it may also affect user
> behavior strongly by recommending non-obvious (non-inevitable)
> items.

It's also a strategic decision: whether you want to use recommendations to reinforce the "long tail" of your catalog or go with the sure thing.



This affect on user behavior AFAIK can't be measured from holdout
tests. I worry that precision related measures may point us in the
wrong direction. Are A/B tests our only reliable metric for questions
like this?


I'm afraid I agree, A/B testing is the only true valid proof that one
recommender config is better than another.

And even A/B testing may point us in the wrong direction. Say we achieve one configuration with which we can measure better sales with enough significance level, then that configuration is the best one from an experimental A/B test, i.e. the Holy Grail of measures. But what if our ultimate goal is customer retention? Maybe those short-term recommendation of, say, very popular items (because we're not using the IDF weights) are achieving sales we would have had anyway but are not helping client loyalty because there's no added value perceived. So in the long term we'll increase churn because our recommendations do not differentiate ourselves.

Life and business are complicated :-)

As for offline metrics, I consider them as a hint that can help in pruning the space of possible recommender configurations. But discarding one system in favour of another based only on precision is risky, the difference would need to be more than significant.


Reply via email to