Abstract: Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput.
In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. We adopt a learning rate that corresponds to a constant average weight update per gradient calculation (i.e., per unit cost of computation), and point out that this results in a variance of the weight updates that increases linearly with the mini-batch size m.
The collected experimental results for the CIFAR-10, CIFAR-100 and ImageNet datasets show that increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. On the other hand, small mini-batch sizes provide more up-to-date gradient calculations, which yields more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
Conclusion: We have presented an empirical study of the performance of mini-batch stochastic gradient descent, and reviewed the underlying theoretical assumptions relating training duration and learning rate scaling to mini-batch size. The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller, often as small as m = 2 or m = 4. With BN and larger datasets, larger batch sizes can be useful, up to batch size m = 32 or m = 64. However, these datasets would typically require a distributed implementation to avoid excessively long training. In these cases, the best solution would be to implement both BN and stochastic gradient optimization over multiple processors, which would imply the use of a small batch size per worker. We have also observed that the best values of the batch size for BN are often smaller than the overall SGD batch size. The results also highlight the optimization difficulties associated with large batch sizes. The range of usable base learning rates significantly decreases for larger batch sizes, often to the extent that the optimal learning rate could not be used. We suggest that this can be attributed to a linear increase in the variance of the weight updates with the batch size. Overall, the experimental results support the broad conclusion that using small batch sizes for training provides benefits both in terms of range of learning rates that provide stable convergence and achieved test performance for a given number of epochs.