Greetings,
I traing MLPRegressors using small datasets, usually with 10-50
observations. The default batch_size=min(200, n_samples) for the adam
optimizer, and because my n_samples is always < 200, it is eventually
batch_size=n_samples. According to the theory, stochastic gradient-based
optimizers
Small batch sizes are typically used to speed up the training (more iterations)
and to avoid the issue that training sets usually don’t fit into memory. Okay,
the additional noise from the stochastic approach may also be helpful to escape
local minima and/or help with generalization performance