Abstract: Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the in- fluence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.
Discussion: Normalization, either Batch Normalization, Layer Normalization, or Weight Normalization makes the learned function invariant to scaling of the weights w. This scaling is strongly affected by regularization. We know of no first order gradient method that can fully eliminate this effect. However, a direct solution of forcing kwk = 1 solves the problem. By doing this we also remove one hyperparameter from the training procedure. As noted by Salimans & Kingma (2016), the effect of weight and batch normalization on the effective learning rate might not necessarily be bad. If no regularization is used, then the norm of the weights tends to increase over time, and so the effective learning rate decreases. Often that is a desirable thing, and many training methods lower the learning rate explicitly. However, the decrease of effective learning rate can be hard to control, and can depend a lot on initial steps of training, which makes it harder to reproduce results. With batch normalization we have added two additional parameters, γ and β, and it of course makes sense to also regularize these. In our experiments we did not use regularization for these parameters, though preliminary experiments show that regularization here does not affect the results. This is not very surprising, since with rectified linear activation functions, scaling of γ also has no effect on the function value in subsequent layers. So the only parameters that are actually regularized are the γ’s for the last layer of the network.