Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some metrics to tell which recommendation is better? -Xiangrui
On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng <men...@gmail.com> wrote: > Hi Michael, > > I can help check the current implementation. Would you please go to > https://spark-project.atlassian.net/browse/SPARK and create a ticket > about this issue with component "MLlib"? Thanks! > > Best, > Xiangrui > > On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman <m...@allman.ms> wrote: >> Hi, >> >> I'm implementing a recommender based on the algorithm described in >> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the >> basis for Spark's ALS implementation for data sets with implicit features. >> The data set I'm working with is proprietary and I cannot share it, however >> I can say that it's based on the same kind of data in the paper---relative >> viewing time of videos. (Specifically, the "rating" for each video is >> defined as total viewing time across all visitors divided by video >> duration). >> >> I'm seeing counterintuitive, sometimes nonsensical recommendations. For >> comparison, I've run the training data through Oryx's in-VM implementation >> of implicit ALS with the same parameters. Oryx uses the same algorithm. >> (Source in this file: >> https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java) >> >> The recommendations made by each system compared to one other are very >> different---moreso than I think could be explained by differences in initial >> state. The recommendations made by the Oryx models look much better, >> especially as I increase the number of latent factors and the iterations. >> The Spark models' recommendations don't improve with increases in either >> latent factors or iterations. Sometimes, they get worse. >> >> Because of the (understandably) highly-optimized and terse style of Spark's >> ALS implementation, I've had a very hard time following it well enough to >> debug the issue definitively. However, I have found a section of code that >> looks incorrect. As described in the paper, part of the implicit ALS >> algorithm involves computing a matrix product YtCuY (equation 4 in the >> paper). To optimize this computation, this expression is rewritten as YtY + >> Yt(Cu - I)Y. I believe that's what should be happening here: >> >> https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376 >> >> However, it looks like this code is in fact computing YtY + YtY(Cu - I), >> which is the same as YtYCu. If so, that's a bug. Can someone familiar with >> this code evaluate my claim? >> >> Cheers, >> >> Michael