On Mon, May 13, 2013 at 10:48 PM, Dan Filimon <[email protected]>wrote:
> SGD and batch gradient descent have the same expected errors however. But not the same rate of convergence. SGD converges with 1/n [1]. Second order techniques converge quadratically in the number of iterations but each iteration can be quite expensive in terms of the number of points. Mini-batch algorithms can combine these by doing batch updates on small batches. They can be viewed as mini-batch or mega-SGD depending on your point of view. SGD can also use a higher order conjugate gradient technique like BFGS. Vowpal wabbit uses this to good effect [2]. Update averaging can have similar effect [3]. [1] http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf [2] http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit [3] http://arxiv.org/pdf/1303.6149v1.pdf
