Sean, Walrus, Great catch. I think this is a bug in the code (see below for a comparison of the current vs the correct code). Also, here's another link<http://cbcb.umd.edu/~hcorrada/PML/homeworks/HW04_solutions.pdf>describing the derivation.
-Ameet *CURRENT* newWeights = weightsOld.sub(normGradient).div(2.0 * thisIterStepSize * regParam + 1.0) *CORRECT* newWeights = weightsOld.mul(1.0 - 2.0 * thisIterStepSize * regParam).sub( normGradient) On Thu, Jan 9, 2014 at 11:45 AM, Sean Owen <[email protected]> wrote: > Yes, the regularization term just adds a bunch of (theta_i)^2 terms. > The partial derivative with respect to theta_i is simply 2*theta_i > since all the other new regularization terms are 0 w.r.t. theta_i. The > regularization term just adds the weight vector itself to the gradient > -- simples. > > ... give or take a factor of 2. To be fair there is minor variation in > convention here; some put a factor of 1/2 in front of the L2 > regularization term to absorb the 2 in the partial derivatives, for > tidiness. It doesn't matter in the sense that it's the same as using a > lambda half as large, but then again, that does matter if you're > trying to make apples-to-apples comparisons with another > implementation. > > See about slide 20 here for some clear equations: > > http://people.cs.umass.edu/~sheldon/teaching/2012fa/ml/files/lec7-annotated.pdf > > And now I have basically the same question. I'm not sure I get how the > code in Updater implements L2 regression. I see the > weights-minus-gradient part, but the division by the scalar doesn't > look right immediately. It looks like the shrinking term but then > there should be a minus in there, and it ought to be a multiplier on > the old weights only? > > Heh, if it's a slightly different definition, it would really make > Walrus's point! > > > On Thu, Jan 9, 2014 at 7:10 PM, Evan R. Sparks <[email protected]> > wrote: > > Hi, > > > > The L2 update rule is derived from the derivative of the loss function > with > > respect to the model weights - an L2 regularized loss function contains > an > > additional additive term involving the weights. This paper provides some > > useful mathematical background: > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377 > > > > The code that computes the new L2 weight is here: > > > https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala#L90 > > > > The compute function calculates the new weights based on the current > weights > > gradient as computed at each step. Contrast it with the code in the > > SimpleUpdater class to get a sense for how the regularization parameter > is > > incorporated - it's fairly simple. > > > > In general, though, I agree it makes sense to include a discussion of the > > algorithm and a reference to the specific version we implement in the > > scaladoc. > > > > - Evan > > > > > > On Thu, Jan 9, 2014 at 10:49 AM, Walrus theCat <[email protected]> > > wrote: > >> > >> No -- I'm not, and I appreciate the comment. What I'm looking for is a > >> specific mathematical formula that I can map to the source code. > >> > >> Personally, specifically, I'd like to see how the loss function gets > >> embedded into the w (gradient), in the case of the regularized and > >> unregularized operation. > >> > >> Looking through the source, the "loss history" makes sense to me, but I > >> can't see how that translates into the effect on the gradient. > >> > >> > >> On Thu, Jan 9, 2014 at 10:39 AM, Sean Owen <[email protected]> wrote: > >>> > >>> L2 regularization just means "regularizing by penalizing parameters > >>> whose L2 norm is large", and L2 norm just means squared length. It's > >>> not something you would write an ML paper on any more than what the > >>> vector dot product is. Are you asking something else? > >>> > >>> On Thu, Jan 9, 2014 at 6:19 PM, Walrus theCat <[email protected]> > >>> wrote: > >>> > Thanks Christopher, > >>> > > >>> > I wanted to know if there was a specific paper this particular > codebase > >>> > was > >>> > based on. For instance, Weka cites papers in their documentation. > >>> > > >>> > > >>> > On Wed, Jan 8, 2014 at 7:10 PM, Christopher Nguyen <[email protected]> > >>> > wrote: > >>> >> > >>> >> Walrus, given the question, this may be a good place for you to > start. > >>> >> There's some good discussion there as well as links to papers. > >>> >> > >>> >> > >>> >> > >>> >> > http://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-regularization > >>> >> > >>> >> Sent while mobile. Pls excuse typos etc. > >>> >> > >>> >> On Jan 8, 2014 2:24 PM, "Walrus theCat" <[email protected]> > >>> >> wrote: > >>> >>> > >>> >>> Hi, > >>> >>> > >>> >>> Can someone point me to the paper that algorithm is based on? > >>> >>> > >>> >>> Thanks > >>> > > >>> > > >> > >> > > >
