### Table of Contents

Short Summary: Normalization approaches help reduce internal covariate shift and make the training more stable, effectively reducing the need for weight initialization techniques. There are three common types of normalization for training: Batch Normalization, Group Normalization, and Layer Normalization.

### Internal Covariate Shift

Definition: Internal Covariate Shift is the change in the distribution of network activations due to the change in network parameters during training.

During the training process, the input can travel through many layers. When the parameters of a layer change, so does the distribution of inputs to subsequent layers. These shifts in input distributions can be problematic for neural networks, especially deep neural networks that could have a large number of layers.

Example: Model is trained on images of black dogs, but images of gray dogs are used in the evaluation step. Pixel values are different enough that this may cause performance degradation.

### Batch Normalization

- Reduces internal covariate shift (as a result, Sigmoid can become competitive with ReLU).
- Results in regularization as the mini-batch is usually created at random (also a drawback).
- Speeds up the training and leads to faster convergence with higher accuracy.
- Allows for higher learning rates without compromising convergence.

\begin{equation} y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \end{equation}

where \(\mathrm{E}[x]\) and \(\mathrm{Var}[x]\) are the mean and variance calculated per dimension over mini-batches.

### Layer Normalization

- Normalizes the input along the feature dimension.
- Does not depend on the batch dimension and thus, is s able to do inference on a single sample.
- Does not perform well on CNNs, but works well on RNNs.

\begin{equation} y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \end{equation}

### Group Normalization

- Input channels are split into some number of groups.
- Means and standard deviations are computed for every group separately.
- Uses statistics computed from the input data in both training and evaluation modes.

\begin{equation} y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \end{equation}