Yes GradientDescent == (batch-)SGD.
That was also my first idea of how to implement it. However, what happens
if the regularization is specific to the actually used algorithm. For
example, for L-BFGS with L1 regularization you have a different
`parameterUpdate` step (Orthant-wise Limited Memory
+1
This separation was the idea from the start, there is trade-off between
having highly configureable optimizers and ensuring that the right types of
regularization can only be applied to optimization algorithms that support
them.
It comes down to viewing the optimization framework mostly as a