Batch Normalization

by voo4
on Sat 19 October 2019

Summary on Ioffe and Szegedy (2015)

Internal Covariate Shift is defined as the change in the distribution of network activations due to the change in network parameters during training. It captures the change in activation values as training progresses. In a Deep Neural Network this change in distribution of inputs at each layer, as training progresses demands - carefuly paremeter initailization and slower learning rates. Batch Normalization is a technique that helps address these problems arising from internal covariate shift.

Also batch normalizaiton helps by - Permitting higher learning rates, and less careful initilization - Acting as Regularizer, eliminating the need for Dropout - Faster training. (Converges with 14 times fewer steps in the examples mentioned in the article)

The problem

Training a Deep Neural Network is complicated - As outputs and gradients are affected by previous layers. Even careful parameter initialization can help keep inputs to activations in good regions, only to a limited extent/time. Beyond which gradients are zero either due to saturation or activations close to origin.

Problems arising from Internal Covariate Shift:

When activations move out of linear regime of sigmoid, Gradients stop flowing.
And when using a relu, gradients are highly dependent on activation values (which experience an internal covariance shift). This might result in uncontrolled step sizes. Thus slower learning rates are required for convergence.

These problems are usually addressed using

RELU as activation
Careful initialization of weights (Glorot and Bengio (2010), Saxe et al. (2014))
Small learning rates

Batch Normalization

Network training is known to converge faster if its inputs are whitened LeCun et al. (1998) - i.e., linearly transformed to have zero means and unit variances, and decorrelated.

It is know that convergence is usually faster if the average of each input is close to zero. In an extreme of all inputs being positive (or all inputs being negative) the weight updates have to be all of the same sign. The weights can only all decrease or all increase. If the actual solution requires some to decrease and some to increase, it can only be done by zigzagging. See Youtube: Lecture 6 | Training Neural Networks I | CS231n for a detailed explanation.

Whiteninig the inputs to a neural network [Mean canellation + Decorrelation(PCA) + Covariance equalization] helps in better gradient flow. But the effects are diminished as we go deeper in the network. Batch normalization tries to apply the same at every layer, in an attempt to removing ill effects of the internal covariate shift.

However a full whitening would require computation of Covariance matrix $\text{Cov}[x] = \mathbb{E}_{x \in \mathcal{X}}[xx^T] - \mathbb{E}[x]\mathbb{E}[x]^T$ and its inverse square root for computation of whitened activations. This is computationally complex and adds more complexity to the backpropogation. Hence decorrelation is typically skipped.

How it's done

At train time

Each dimension's mean and variance is computed across the samples in the mini batch. The input is mean cancelled and scaled by inverse of variance.

For a layer with $d$-dimensional input $\mathrm{x}=\left(x^{(1)} \ldots x^{(d)}\right)$, each dimension is normalized

$$\widehat{x}^{(k)}=\frac{x^{(k)}-\mathrm{E}\left[x^{(k)}\right]}{\sqrt{\operatorname{Var}\left[x^{(k)}\right] + \epsilon}}$$

The above normalization would result in constraining the inputs of every layer to $[-1, 1]$ (a soft constraint). This will make it impossible for the activations (say sigmoid) to ever learn to saturate. Hence additional provision is made to enable the normalization to become an identity function.

For each activation $x^{(k)}$ variables $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the normalized value

$$ y^{(k)}=\gamma^{(k)} \widehat{x}^{(k)}+\beta^{(k)} $$

This Batch Normalizing Transformation can be represented as:

$$ \mathrm{BN}_{\gamma, \beta}: x_{1 \ldots m} \rightarrow y_{1 \ldots m} $$

At evaluation time

During evaluation/test, population statistics replace the batch statistics in batch normalization transformation

$$ \widehat{x}=\frac{x-\mathrm{E}[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} $$

At every layer, for the estimates of $\mathrm{E}\left[x\right]$ and $\operatorname{Var}\left[x\right]$, population statistics are used.

$$ \begin{aligned} \mathrm{E}[x] & = \mathrm{E}_{\mathcal{B}}\left[\mu_{\mathcal{B}}\right] \\ \operatorname{Var}[x] & = \frac{m}{m-1} \mathrm{E}_{\mathcal{B}}\left[\sigma_{\mathcal{B}}^{2}\right] \text{ An unbiased estimate for variance, considering } m \text{ batches} \end{aligned} $$

With batch normalization, the gradients flow through additional paths

Batch Normailzation is a differentiable transformation that introduces normalized activations into the network.

The mean, variance, etc., computed during normalization are considered for gradient backpropagation.

Let $x$ be an input to a layer $F(x,...)$. And $\mathcal{X} = x_{1..N}$ is the set of values of $x$ over the training set.

With batch normalization $F(x,...)$ is transformed to $F(\widehat{x},...)$, where $\widehat{x} = \text{Norm}(x, \mathcal{X})$ represents the whitening/normalization process.

In unnormalized graph backprop, gradients flowing to $x_i$ wouldn't flow through rest of $\mathcal{X}$:

$$\frac{\partial L}{\partial x_i} = \frac{\partial F}{\partial x_i}\frac{\partial L}{\partial F} $$

In the normalized version backprop, $F(\widehat{x},...)$ gradients through $\widehat{x}_i$

$$\frac{\partial L}{\partial \widehat{x}_i} = \frac{\partial F}{\partial \widehat{x}_i}\frac{\partial L}{\partial F} $$

further flow through $x_i$ and rest of $\mathcal{X}$, we need to calculate:

$$\frac{\partial \text{Norm}(x_i,\mathcal{X})}{\partial x_i} \text{ and jacobian } \frac{\partial \text{Norm}(x_i,\mathcal{X})}{\partial \mathcal{X}} $$

Benefits

Activations can receive inputs in the linear regime throught all layers.
Beneficial on the gradient flow through the network, by reducing the dependence of gradients on the scale of parameters. Thus higher learning rates can be used.
Regularizes the model, reducing the need for Dropout.

A training example is seen in conjunction with the other examples in the mini-batch. Thus the network cannot produce deterministic values for a given training example. Thus acting like a regularizer. In other words, if the network tries to overfit a given example, it inadvertantly effects other training examples and thus will be penalized with a higher loss.

Tensorflow implementation

TODO

References

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. 2010. URL: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf. ↩
Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, 2015. URL: http://arxiv.org/abs/1502.03167, arXiv:1502.03167. ↩
Yann LeCun, Leon Bottou, G Orr, and Klaus-Robert Muller. Efficient backprop. Neural Networks: Tricks of the Trade. New York: Springer, 1998. URL: http://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf. ↩
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. 2014. URL: http://arxiv.org/abs/1312.6120. ↩