http://nbviewer.jupyter.org/github/craffel/theano-tutorial/blob/master/Theano%20Tutorial.ipynb (section 24):
updates = [] for param in params: param_update = theano.shared(param.get_value()*0., broadcastable=param. broadcastable) updates.append((param, param - learning_rate*param_update)) updates.append((param_update, momentum*param_update + (1. - momentum)*T. grad(cost, param))) The last two lines appear to be different from the classical momentum definition: update = momentum * previous_update - learning_rate * gradient W_new = W_old + update I tested the implementation from the tutorial, and no matter what value of momentum I use, the results are very similar. On the other hand, if I implement the momentum as: updates.append((param, param + param_update)) updates.append((param_update, momentum * param_update - learning_rate * T. grad(cost, param))) It works as expected (faster convergence as momentum increases). Can anyone explain the implementation in the tutorial? Where did it come from? -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
