(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.
The one issue with the proposed implementation is that it is not compatible with the existing cross-validators within a pipeline. As I've mentioned on the linked JIRAs & PRs, one option is to create a special set of cross-validators for recommenders, that address the issues of (a) dataset splitting specific to recommender settings (user-based stratified sampling, time-based etc) and (b) ranking-based evaluation. The other option is to have the ALSModel itself capable of generating the "ground-truth" set within the same dataframe output from "transform" (ie predict top k) that can be fed into the cross-validator (with RankingEvaluator) directly. That's the approach I took so far in https://github.com/apache/spark/pull/12574. Both options are valid and have their positives & negatives - open to comments / suggestions. On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim <jongw...@nyu.edu> wrote: > Thanks for the clarification and the relevant links. I overlooked the > comments explicitly saying that the relevance is binary. > > I understand that the label is not a relevance, but I have been, and I > think many people are using the label as relevance in the implicit feedback > context where the user-provided exact label is not defined anyway. I think > that's why RiVal <https://github.com/recommenders/rival>'s using the term > "preference" for both the label for MAE and the relevance for NDCG. > > At the same time, I see why Spark decided to assume the relevance is > binary, in part to conform to the class RankingMetrics's constructor. I > think it would be nice if the upcoming DataFrame-based RankingEvaluator can > be optionally set a "relevance column" that has non-binary relevance > values, otherwise defaulting to either 1.0 or the label column. > > My extended version of RankingMetrics is here: > https://github.com/jongwook/spark-ranking-metrics . It has a test case > checking that the numbers are same as RiVal's. > > Jong Wook > > > > On 19 September 2016 at 03:13, Sean Owen <so...@cloudera.com> wrote: > >> Yes, relevance is always 1. The label is not a relevance score so >> don't think it's valid to use it as such. >> >> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jongw...@nyu.edu> wrote: >> > Hi, >> > >> > I'm trying to evaluate a recommendation model, and found that Spark and >> > Rival give different results, and it seems that Rival's one is what >> Kaggle >> > defines: >> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597 >> > >> > Am I using RankingMetrics in a wrong way, or is Spark's implementation >> > incorrect? >> > >> > To my knowledge, NDCG should be dependent on the relevance (or >> preference) >> > values, but Spark's implementation seems not; it uses 1.0 where it >> should be >> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also >> tried >> > tweaking, but its method to obtain the ideal DCG also seems wrong. >> > >> > Any feedback from MLlib developers would be appreciated. I made a >> > modified/extended version of RankingMetrics that produces the identical >> > numbers to Kaggle and Rival's results, and I'm wondering if it is >> something >> > appropriate to be added back to MLlib. >> > >> > Jong Wook >> > >