It would be helpful to know what parameter inputs you are using. If the regularization schemes are different (by a factor of alpha, which can often be quite high) this will mean that the same parameter settings could give very different results. A higher lambda would be required with Spark's version to be comparable.
When I submitted the PR for this, I verified (on ml-100k, ml-1m and ml-10m data) that this version gives the same RMSE as Mahout's implicit model, as well as a separate Spark version that I wrote that was a from-scratch port of the Mahout algorithm (though I didn't compare vs Myrrix/Oryx). I'm fairly confident things are correct but if there is a bug let's definitely find and fix it! @Sean, would it be a good idea to look at changing the regularization in Spark's ALS to alpha * lambda? What is the thinking behind this? If I recall, the Mahout version added something like (# ratings * lambda) as regularization in each factor update (for explicit), but implicit it was just lambda (I may be wrong here). On Wed, Mar 12, 2014 at 4:57 AM, Xiangrui Meng <men...@gmail.com> wrote: > Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i > x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some > metrics to tell which recommendation is better? -Xiangrui > > On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng <men...@gmail.com> wrote: > > Hi Michael, > > > > I can help check the current implementation. Would you please go to > > https://spark-project.atlassian.net/browse/SPARK and create a ticket > > about this issue with component "MLlib"? Thanks! > > > > Best, > > Xiangrui > > > > On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman <m...@allman.ms> wrote: > >> Hi, > >> > >> I'm implementing a recommender based on the algorithm described in > >> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms > the > >> basis for Spark's ALS implementation for data sets with implicit > features. > >> The data set I'm working with is proprietary and I cannot share it, > however > >> I can say that it's based on the same kind of data in the > paper---relative > >> viewing time of videos. (Specifically, the "rating" for each video is > >> defined as total viewing time across all visitors divided by video > >> duration). > >> > >> I'm seeing counterintuitive, sometimes nonsensical recommendations. For > >> comparison, I've run the training data through Oryx's in-VM > implementation > >> of implicit ALS with the same parameters. Oryx uses the same algorithm. > >> (Source in this file: > >> > https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java > ) > >> > >> The recommendations made by each system compared to one other are very > >> different---moreso than I think could be explained by differences in > initial > >> state. The recommendations made by the Oryx models look much better, > >> especially as I increase the number of latent factors and the > iterations. > >> The Spark models' recommendations don't improve with increases in either > >> latent factors or iterations. Sometimes, they get worse. > >> > >> Because of the (understandably) highly-optimized and terse style of > Spark's > >> ALS implementation, I've had a very hard time following it well enough > to > >> debug the issue definitively. However, I have found a section of code > that > >> looks incorrect. As described in the paper, part of the implicit ALS > >> algorithm involves computing a matrix product YtCuY (equation 4 in the > >> paper). To optimize this computation, this expression is rewritten as > YtY + > >> Yt(Cu - I)Y. I believe that's what should be happening here: > >> > >> > https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376 > >> > >> However, it looks like this code is in fact computing YtY + YtY(Cu - I), > >> which is the same as YtYCu. If so, that's a bug. Can someone familiar > with > >> this code evaluate my claim? > >> > >> Cheers, > >> > >> Michael >