Batch, Group, and Layer Normalization

Table of Contents

Short Summary: Normalization approaches help reduce internal covariate shift and make the training more stable, effectively reducing the need for weight initialization techniques. There are three common types of normalization for training: Batch Normalization, Group Normalization, and Layer Normalization.

Internal Covariate Shift

Definition: Internal Covariate Shift is the change in the distribution of network activations due to the change in network parameters during training.

During the training process, the input can travel through many layers. When the parameters of a layer change, so does the distribution of inputs to subsequent layers. These shifts in input distributions can be problematic for neural networks, especially deep neural networks that could have a large number of layers.

Example: Model is trained on images of black dogs, but images of gray dogs are used in the evaluation step. Pixel values are different enough that this may cause performance degradation.

Batch Normalization

\begin{equation} y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \end{equation}

where \(\mathrm{E}[x]\) and \(\mathrm{Var}[x]\) are the mean and variance calculated per dimension over mini-batches.

Layer Normalization

\begin{equation} y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \end{equation}

Group Normalization

\begin{equation} y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \end{equation}