I've been thoroughly investigating this issue over the past couple of days
and have discovered quite a bit. For one thing, there is definitely (at
least) one issue/bug in the Spark implementation that leads to incorrect
results for models generated with rank > 1 or a large number of iterations.
I will post a bug report with a thorough explanation this weekend or on
Monday.

I believe I've been able to track down every difference between the Spark
and Oryx implementations that lead to difference results. I made some
adjustments to the spark implementation so that, given the same initial
product/item vectors, the resulting model is identical to the one produced
by Oryx within a small numerical tolerance. I've verified this for small
data sets and am working on verifying this with some large data sets.

Aside from those already identified in this thread, another significant
difference in the Spark implementation is that it begins the factorization
process by computing the product matrix (Y) from the initial user matrix
(X). Both of the papers on ALS referred to in this thread begin the process
by computing the user matrix. I haven't done any testing comparing the
models generated starting from Y or X, but they are very different. Is there
a reason Spark begins the iteration by computing Y?

Initializing both X and Y as is done in the Spark implementation seems
unnecessary unless I'm overlooking some desired side-effect. Only the factor
matrix which generates the other in the first iteration needs to be
initialized.

I also found that the product and user RDDs were being rebuilt many times
over in my tests, even for tiny data sets. By persisting the RDD returned
from updateFeatures() I was able to avoid a raft of duplicate computations.
Is there a reason not to do this?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to