Re: Removing Training Order Dependency on SGD Classification

Ted Dunning Tue, 14 May 2013 00:41:51 -0700

On Tue, May 14, 2013 at 12:06 AM, Dan Filimon
<[email protected]>wrote:


> Nice! Thanks for the links Ted!
>

yeah... I can google!


> >
> > SGD converges with 1/n [1].  Second order techniques converge
> quadratically
> > in the number of iterations but each iteration can be quite expensive in
> > terms of the number of points.
> >
>
> Right, but iteration in SGD is processing just 1 point whereas in BGD it's
> processing n points

And also, I might be confused here, but I thought BGD is not a second order
> technique. It still only computes the gradient which is the first
> derivative (or rather, the version I learned in school does this :) ?
>

BGD = Batch Gradient Descent is, indeed, not quadratic.

But there are second order methods.  And quasi-second order methods that
are forms of conjugate gradients.


> > Mini-batch algorithms can combine these by doing batch updates on small
> > batches.  They can be viewed as mini-batch or mega-SGD depending on your
> > point of view.
> >
> > SGD can also use a higher order conjugate gradient technique like BFGS.
> >  Vowpal wabbit uses this to good effect [2].  Update averaging can have
> > similar effect [3].
> >
>
> About update averaging, is it basically averaging k updates (by looking at
> k points I mean) into 1 per iteration?
>

Yeah.

The reason it works (I haven't gone through the details in detail) is that
this averaging is like a poor man's conjugate gradient.


>
> >
> > [1] http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf
> >
> > [2]
> http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit
> >
> > [3] http://arxiv.org/pdf/1303.6149v1.pdf
> >
>

Re: Removing Training Order Dependency on SGD Classification

Reply via email to