> The affect of downweighting the popular items is very similar to
> removing them from recommendations so I still suspect precision will
> go down using IDF. Obviously this can pretty easily be tested, I just
> wondered if anyone had already done it.
>
> This brings up a problem with holdout based precision. It measures
> the value of a model trained on a training set in predicting
> something that is in the holdout set. This may or may not correlate
> with affecting user behavior.
Indeed. The problem with holdout sets is that they only indicate what
users did with certain items. There's no way to know what they would
have done with items they were not exposed to.
>
> To use purchases as preference indicators, a precision metric would
> measure how well purchases in the trianing set predicted purchases in
> the test set. If IDF lowers precision, it may also affect user
> behavior strongly by recommending non-obvious (non-inevitable)
> items.
It's also a strategic decision: whether you want to use recommendations
to reinforce the "long tail" of your catalog or go with the sure thing.
This affect on user behavior AFAIK can't be measured from holdout
tests. I worry that precision related measures may point us in the
wrong direction. Are A/B tests our only reliable metric for questions
like this?
I'm afraid I agree, A/B testing is the only true valid proof that one
recommender config is better than another.
And even A/B testing may point us in the wrong direction. Say we achieve
one configuration with which we can measure better sales with enough
significance level, then that configuration is the best one from an
experimental A/B test, i.e. the Holy Grail of measures. But what if our
ultimate goal is customer retention? Maybe those short-term
recommendation of, say, very popular items (because we're not using the
IDF weights) are achieving sales we would have had anyway but are not
helping client loyalty because there's no added value perceived. So in
the long term we'll increase churn because our recommendations do not
differentiate ourselves.
Life and business are complicated :-)
As for offline metrics, I consider them as a hint that can help in
pruning the space of possible recommender configurations. But discarding
one system in favour of another based only on precision is risky, the
difference would need to be more than significant.