Re: possible bug in Spark's ALS implementation...

Debasish Das Wed, 02 Apr 2014 00:20:12 -0700

I think multiply by ratings is a heuristic that worked on rating related
problems like netflix dataset or any other ratings datasets but the scope
of NMF is much more broad than that....


@Sean please correct me in case you don't agree...

Definitely it's good to add all the rating dataset related optimizations
but it should not be in the core algorithm but say RatingALS which extends
upon core ALS...



On Wed, Apr 2, 2014 at 12:06 AM, Michael Allman <m...@allman.ms> wrote:

> Hi Nick,
>
> I don't have my spark clone in front of me, but OTOH the major differences
> are/were:
>
> 1. Oryx multiplies lambda by alpha.
>
> 2. Oryx uses a different matrix inverse algorithm. It maintains a certain
> symmetry which the Spark algo does not, however I don't think this
> difference has a real impact on the results.
>
> 3. Oryx supports the specification of a convergence threshold for
> termination of the algorithm, based on delta rmse on a subset of the
> training set if I recall correctly. I've been using that as the
> termination criterion instead of a fixed number of iterations.
>
> 4. Oryx uses the weighted regularization scheme you alluded to below,
> multiplying lambda by the number of ratings.
>
> I've patched the spark impl to support (4) but haven't pushed it to my
> clone on github. I think it would be a valuable feature to support
> officially. I'd also like to work on (3) but don't have time now. I've
> only been using Oryx the past couple of weeks.
>
> Cheers,
>
> Michael
>
> On Tue, 1 Apr 2014, Nick Pentreath [via Apache Spark User List] wrote:
>
> > Hi Michael
> > Would you mind setting out exactly what differences you did find between
> the
> > Spark and Oryx implementations? Would be good to be clear on them, and
> also
> > see if there are further tricks/enhancements from the Oryx one that can
> be
> > ported (such as the lambda * numRatings adjustment).
> >
> > N
> >
> >
> > On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hidden email]> wrote:
> >       I've been thoroughly investigating this issue over the past
> >       couple of days
> >       and have discovered quite a bit. For one thing, there is
> >       definitely (at
> >       least) one issue/bug in the Spark implementation that leads to
> >       incorrect
> >       results for models generated with rank > 1 or a large number of
> >       iterations.
> >       I will post a bug report with a thorough explanation this
> >       weekend or on
> >       Monday.
> >
> >       I believe I've been able to track down every difference between
> >       the Spark
> >       and Oryx implementations that lead to difference results. I made
> >       some
> >       adjustments to the spark implementation so that, given the same
> >       initial
> >       product/item vectors, the resulting model is identical to the
> >       one produced
> >       by Oryx within a small numerical tolerance. I've verified this
> >       for small
> >       data sets and am working on verifying this with some large data
> >       sets.
> >
> >       Aside from those already identified in this thread, another
> >       significant
> >       difference in the Spark implementation is that it begins the
> >       factorization
> >       process by computing the product matrix (Y) from the initial
> >       user matrix
> >       (X). Both of the papers on ALS referred to in this thread begin
> >       the process
> >       by computing the user matrix. I haven't done any testing
> >       comparing the
> >       models generated starting from Y or X, but they are very
> >       different. Is there
> >       a reason Spark begins the iteration by computing Y?
> >
> >       Initializing both X and Y as is done in the Spark implementation
> >       seems
> >       unnecessary unless I'm overlooking some desired side-effect.
> >       Only the factor
> >       matrix which generates the other in the first iteration needs to
> >       be
> >       initialized.
> >
> >       I also found that the product and user RDDs were being rebuilt
> >       many times
> >       over in my tests, even for tiny data sets. By persisting the RDD
> >       returned
> >       from updateFeatures() I was able to avoid a raft of duplicate
> >       computations.
> >       Is there a reason not to do this?
> >
> >       Thanks.
> >
> >
> >
> >       --
> >       View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
> >       -ALS-implementation-tp2567p2704.html
> >       Sent from the Apache Spark User List mailing list archive at
> >       Nabble.com.
> >
> >
> >
> ____________________________________________________________________________
>
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s
> > -ALS-implementation-tp2567p3588.html
> > To unsubscribe from possible bug in Spark's ALS implementation..., click
> > here. NAML
> >
> >
>
> ------------------------------
> View this message in context: Re: possible bug in Spark's ALS
> implementation...<http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p3619.html>
>
> Sent from the Apache Spark User List mailing list 
> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>

Re: possible bug in Spark's ALS implementation...

Reply via email to