Re: Removing Training Order Dependency on SGD Classification

Ted Dunning Mon, 13 May 2013 23:21:13 -0700

On Mon, May 13, 2013 at 10:48 PM, Dan Filimon
<[email protected]>wrote:


> SGD and batch gradient descent have the same expected errors however.


But not the same rate of convergence.

SGD converges with 1/n [1].  Second order techniques converge quadratically
in the number of iterations but each iteration can be quite expensive in
terms of the number of points.

Mini-batch algorithms can combine these by doing batch updates on small
batches.  They can be viewed as mini-batch or mega-SGD depending on your
point of view.

SGD can also use a higher order conjugate gradient technique like BFGS.
 Vowpal wabbit uses this to good effect [2].  Update averaging can have
similar effect [3].

[1] http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf

[2] http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit

[3] http://arxiv.org/pdf/1303.6149v1.pdf

Re: Removing Training Order Dependency on SGD Classification

Reply via email to