Re: possible bug in Spark's ALS implementation...

Sean Owen Wed, 02 Apr 2014 06:31:26 -0700

It should be kept in mind that different implementations are rarely
strictly better, and that what works well in one type of data might
not in another. It also bears keeping in mind that several of these
differences just amount to different amounts of regularization, which
need not be a difference.


#1 is just a question of 'semantics' really and not an algorithmic difference.

#4 is more subtle and I think it is a small positive to use weighted
regularization -- for all types of data. It is not specific to
explicit data since it is based on counts. That said the effect of
this is mostly "more regularization across the boards", and can be
simulated with a higher global lambda.

There might be something to #2. I suspect the QR decomposition is more
accurate and that can matter here. I have no evidence for that though.
(The symmetry wasn't actually an issue in the end no?)

#3 won't affect the result so much as when to stop. That is stopping
in a principled way after X iterations is good, but it can be
simulated with just specifying X iterations.

If I were to experiment next I would focus on #2 and maybe #4.
--
Sean Owen | Director, Data Science | London


On Wed, Apr 2, 2014 at 9:06 AM, Michael Allman <m...@allman.ms> wrote:
> Hi Nick,
>
> I don't have my spark clone in front of me, but OTOH the major differences
> are/were:
>
> 1. Oryx multiplies lambda by alpha.
>
> 2. Oryx uses a different matrix inverse algorithm. It maintains a certain
> symmetry which the Spark algo does not, however I don't think this
> difference has a real impact on the results.
>
> 3. Oryx supports the specification of a convergence threshold for
> termination of the algorithm, based on delta rmse on a subset of the
> training set if I recall correctly. I've been using that as the
> termination criterion instead of a fixed number of iterations.
>
> 4. Oryx uses the weighted regularization scheme you alluded to below,
> multiplying lambda by the number of ratings.
>
> I've patched the spark impl to support (4) but haven't pushed it to my
> clone on github. I think it would be a valuable feature to support
> officially. I'd also like to work on (3) but don't have time now. I've
> only been using Oryx the past couple of weeks.
>
> Cheers,
>
> Michael
>
> On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:
>
>> Hi Michael
>> Would you mind setting out exactly what differences you did find between
>> the
>> Spark and Oryx implementations? Would be good to be clear on them, and
>> also
>> see if there are further tricks/enhancements from the Oryx one that can be
>> ported (such as the lambda * numRatings adjustment).
>>
>> N
>>
>>
>> On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
>>       I've been thoroughly investigating this issue over the past
>>       couple of days
>>       and have discovered quite a bit. For one thing, there is
>>       definitely (at
>>       least) one issue/bug in the Spark implementation that leads to
>>       incorrect
>>       results for models generated with rank > 1 or a large number of
>>       iterations.
>>       I will post a bug report with a thorough explanation this
>>       weekend or on
>>       Monday.
>>
>>       I believe I've been able to track down every difference between
>>       the Spark
>>       and Oryx implementations that lead to difference results. I made
>>       some
>>       adjustments to the spark implementation so that, given the same
>>       initial
>>       product/item vectors, the resulting model is identical to the
>>       one produced
>>       by Oryx within a small numerical tolerance. I've verified this
>>       for small
>>       data sets and am working on verifying this with some large data
>>       sets.
>>
>>       Aside from those already identified in this thread, another
>>       significant
>>       difference in the Spark implementation is that it begins the
>>       factorization
>>       process by computing the product matrix (Y) from the initial
>>       user matrix
>>       (X). Both of the papers on ALS referred to in this thread begin
>>       the process
>>       by computing the user matrix. I haven't done any testing
>>       comparing the
>>       models generated starting from Y or X, but they are very
>>       different. Is there
>>       a reason Spark begins the iteration by computing Y?
>>
>>       Initializing both X and Y as is done in the Spark implementation
>>       seems
>>       unnecessary unless I'm overlooking some desired side-effect.
>>       Only the factor
>>       matrix which generates the other in the first iteration needs to
>>       be
>>       initialized.
>>
>>       I also found that the product and user RDDs were being rebuilt
>>       many times
>>       over in my tests, even for tiny data sets. By persisting the RDD
>>       returned
>>       from updateFeatures() I was able to avoid a raft of duplicate
>>       computations.
>>       Is there a reason not to do this?
>>
>>       Thanks.
>>
>>
>>
>>       --
>>       View this message in
>> context:http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>>       -ALS-implementation-tp2567p2704.html
>>       Sent from the Apache Spark User List mailing list archive at
>>       Nabble.com.
>>
>>
>>
>> ____________________________________________________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
>> -ALS-implementation-tp2567p3588.html
>> To unsubscribe from possible bug in Spark's ALS implementation..., click
>> here. NAML
>>
>>
>
> ________________________________
> View this message in context: Re: possible bug in Spark's ALS
> implementation...
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: possible bug in Spark's ALS implementation...

Reply via email to