Nice! Thanks for the links Ted! On Tue, May 14, 2013 at 9:20 AM, Ted Dunning <[email protected]> wrote:
> On Mon, May 13, 2013 at 10:48 PM, Dan Filimon > <[email protected]>wrote: > > > SGD and batch gradient descent have the same expected errors however. > > > But not the same rate of convergence. > > SGD converges with 1/n [1]. Second order techniques converge quadratically > in the number of iterations but each iteration can be quite expensive in > terms of the number of points. > Right, but iteration in SGD is processing just 1 point whereas in BGD it's processing n points. And also, I might be confused here, but I thought BGD is not a second order technique. It still only computes the gradient which is the first derivative (or rather, the version I learned in school does this :) ? > Mini-batch algorithms can combine these by doing batch updates on small > batches. They can be viewed as mini-batch or mega-SGD depending on your > point of view. > > SGD can also use a higher order conjugate gradient technique like BFGS. > Vowpal wabbit uses this to good effect [2]. Update averaging can have > similar effect [3]. > About update averaging, is it basically averaging k updates (by looking at k points I mean) into 1 per iteration? > > [1] http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf > > [2] http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit > > [3] http://arxiv.org/pdf/1303.6149v1.pdf >
