Re: possible bug in Spark's ALS implementation...

Michael Allman Wed, 02 Apr 2014 00:08:24 -0700

Hi Nick,

I don't have my spark clone in front of me, but OTOH the major differences 
are/were:


1. Oryx multiplies lambda by alpha.

2. Oryx uses a different matrix inverse algorithm. It maintains a certain 
symmetry which the Spark algo does not, however I don't think this 
difference has a real impact on the results.

3. Oryx supports the specification of a convergence threshold for 
termination of the algorithm, based on delta rmse on a subset of the 
training set if I recall correctly. I've been using that as the 
termination criterion instead of a fixed number of iterations.

4. Oryx uses the weighted regularization scheme you alluded to below, 
multiplying lambda by the number of ratings.

I've patched the spark impl to support (4) but haven't pushed it to my 
clone on github. I think it would be a valuable feature to support 
officially. I'd also like to work on (3) but don't have time now. I've 
only been using Oryx the past couple of weeks.

Cheers,

Michael

On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:

> Hi Michael
> Would you mind setting out exactly what differences you did find between the
> Spark and Oryx implementations? Would be good to be clear on them, and also
> see if there are further tricks/enhancements from the Oryx one that can be
> ported (such as the lambda * numRatings adjustment).
> 
> N
> 
> 
> On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
>       I've been thoroughly investigating this issue over the past
>       couple of days
>       and have discovered quite a bit. For one thing, there is
>       definitely (at
>       least) one issue/bug in the Spark implementation that leads to
>       incorrect
>       results for models generated with rank > 1 or a large number of
>       iterations.
>       I will post a bug report with a thorough explanation this
>       weekend or on
>       Monday.
>
>       I believe I've been able to track down every difference between
>       the Spark
>       and Oryx implementations that lead to difference results. I made
>       some
>       adjustments to the spark implementation so that, given the same
>       initial
>       product/item vectors, the resulting model is identical to the
>       one produced
>       by Oryx within a small numerical tolerance. I've verified this
>       for small
>       data sets and am working on verifying this with some large data
>       sets.
>
>       Aside from those already identified in this thread, another
>       significant
>       difference in the Spark implementation is that it begins the
>       factorization
>       process by computing the product matrix (Y) from the initial
>       user matrix
>       (X). Both of the papers on ALS referred to in this thread begin
>       the process
>       by computing the user matrix. I haven't done any testing
>       comparing the
>       models generated starting from Y or X, but they are very
>       different. Is there
>       a reason Spark begins the iteration by computing Y?
>
>       Initializing both X and Y as is done in the Spark implementation
>       seems
>       unnecessary unless I'm overlooking some desired side-effect.
>       Only the factor
>       matrix which generates the other in the first iteration needs to
>       be
>       initialized.
>
>       I also found that the product and user RDDs were being rebuilt
>       many times
>       over in my tests, even for tiny data sets. By persisting the RDD
>       returned
>       from updateFeatures() I was able to avoid a raft of duplicate
>       computations.
>       Is there a reason not to do this?
>
>       Thanks.
> 
> 
>
>       --
>       View this message in 
> context:http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>       -ALS-implementation-tp2567p2704.html
>       Sent from the Apache Spark User List mailing list archive at
>       Nabble.com.
> 
> 
> ____________________________________________________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
> -ALS-implementation-tp2567p3588.html
> To unsubscribe from possible bug in Spark's ALS implementation..., click
> here. NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p3619.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: possible bug in Spark's ALS implementation...

Reply via email to