On Tue, May 14, 2013 at 12:06 AM, Dan Filimon <[email protected]>wrote:
> Nice! Thanks for the links Ted! > yeah... I can google! > > > > SGD converges with 1/n [1]. Second order techniques converge > quadratically > > in the number of iterations but each iteration can be quite expensive in > > terms of the number of points. > > > > Right, but iteration in SGD is processing just 1 point whereas in BGD it's > processing n points And also, I might be confused here, but I thought BGD is not a second order > technique. It still only computes the gradient which is the first > derivative (or rather, the version I learned in school does this :) ? > BGD = Batch Gradient Descent is, indeed, not quadratic. But there are second order methods. And quasi-second order methods that are forms of conjugate gradients. > > Mini-batch algorithms can combine these by doing batch updates on small > > batches. They can be viewed as mini-batch or mega-SGD depending on your > > point of view. > > > > SGD can also use a higher order conjugate gradient technique like BFGS. > > Vowpal wabbit uses this to good effect [2]. Update averaging can have > > similar effect [3]. > > > > About update averaging, is it basically averaging k updates (by looking at > k points I mean) into 1 per iteration? > Yeah. The reason it works (I haven't gone through the details in detail) is that this averaging is like a poor man's conjugate gradient. > > > > > [1] http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf > > > > [2] > http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit > > > > [3] http://arxiv.org/pdf/1303.6149v1.pdf > > >
