Hi Michael

Would you mind setting out exactly what differences you did find between
the Spark and Oryx implementations? Would be good to be clear on them, and
also see if there are further tricks/enhancements from the Oryx one that
can be ported (such as the lambda * numRatings adjustment).

N


On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <m...@allman.ms> wrote:

> I've been thoroughly investigating this issue over the past couple of days
> and have discovered quite a bit. For one thing, there is definitely (at
> least) one issue/bug in the Spark implementation that leads to incorrect
> results for models generated with rank > 1 or a large number of iterations.
> I will post a bug report with a thorough explanation this weekend or on
> Monday.
>
> I believe I've been able to track down every difference between the Spark
> and Oryx implementations that lead to difference results. I made some
> adjustments to the spark implementation so that, given the same initial
> product/item vectors, the resulting model is identical to the one produced
> by Oryx within a small numerical tolerance. I've verified this for small
> data sets and am working on verifying this with some large data sets.
>
> Aside from those already identified in this thread, another significant
> difference in the Spark implementation is that it begins the factorization
> process by computing the product matrix (Y) from the initial user matrix
> (X). Both of the papers on ALS referred to in this thread begin the process
> by computing the user matrix. I haven't done any testing comparing the
> models generated starting from Y or X, but they are very different. Is
> there
> a reason Spark begins the iteration by computing Y?
>
> Initializing both X and Y as is done in the Spark implementation seems
> unnecessary unless I'm overlooking some desired side-effect. Only the
> factor
> matrix which generates the other in the first iteration needs to be
> initialized.
>
> I also found that the product and user RDDs were being rebuilt many times
> over in my tests, even for tiny data sets. By persisting the RDD returned
> from updateFeatures() I was able to avoid a raft of duplicate computations.
> Is there a reason not to do this?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to