Re: Removing Training Order Dependency on SGD Classification

Dan Filimon Tue, 14 May 2013 00:07:15 -0700

Nice! Thanks for the links Ted!

On Tue, May 14, 2013 at 9:20 AM, Ted Dunning <[email protected]> wrote:

> On Mon, May 13, 2013 at 10:48 PM, Dan Filimon
> <[email protected]>wrote:
>
> > SGD and batch gradient descent have the same expected errors however.
>
>
> But not the same rate of convergence.
>
> SGD converges with 1/n [1].  Second order techniques converge quadratically
> in the number of iterations but each iteration can be quite expensive in
> terms of the number of points.
>

Right, but iteration in SGD is processing just 1 point whereas in BGD it's
processing n points.
And also, I might be confused here, but I thought BGD is not a second order
technique. It still only computes the gradient which is the first
derivative (or rather, the version I learned in school does this :) ?

> Mini-batch algorithms can combine these by doing batch updates on small
> batches.  They can be viewed as mini-batch or mega-SGD depending on your
> point of view.
>
> SGD can also use a higher order conjugate gradient technique like BFGS.
>  Vowpal wabbit uses this to good effect [2].  Update averaging can have
> similar effect [3].
>

About update averaging, is it basically averaging k updates (by looking at
k points I mean) into 1 per iteration?

>
> [1] http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf
>
> [2] http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit
>
> [3] http://arxiv.org/pdf/1303.6149v1.pdf
>

Re: Removing Training Order Dependency on SGD Classification

Reply via email to