Regularization

Short Summary: Regularization reduces variance and thus, helps model to generalize. It can be especially useful in models that are naturally prone to overfitting, such as decision trees.

L1 Regularization, L2 Regularization and Elastic Net Regression

L1 Regularization, also known as Lasso Regression, adds the regularization term as the sum of the absolute values of all the feature weights. This term is multiplied by the parameter lambda. L1 Regularization introduces sparsity into the system by shrinking the weights by a constant value. It is used for both feature selection and regularization. Feature selection is done by zeroing out weights. This results in regularization since trying the smaller set of features can reduce variance (adding features usually makes fitting the data easier or at least, not harder).

\begin{equation} L(p, y) = f(p, y) + \lambda_1\sum^n |\theta| \end{equation}

L2 Regularization, also known as Ridge Regression, adds the regularization term as the sum of the squares of all the feature weights. This term is multiplied by the parameter lambda (higher the value, higher the change to underfit and lower the value, higher the change to overfit). L2 Regularization introduces noise into the system and shrinks the weights evenly (since the derivative in the backpropagation would be \(2\theta\) and we subtract it from weights). This results in the regularization effect, making it more difficult for the model to overfit. Additionally, L2 Regularization encourages weight values toward 0 (but not exactly 0) and the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.

\begin{equation} L(p, y) = f(p, y) + \lambda\sum^n \theta^2 \end{equation}

Elastic Net Regression combines L1 Regularization with L2 Regularization.

\begin{equation} L(p, y) = f(p, y) + \lambda_1\sum^n |\theta| + \lambda_2\sum^n \theta^2 \end{equation}

Weight Decay

Weight decay, while similar to L2 regularization, is not the same. The major difference between L2 regularization and weight decay is that while the former modifies the gradients to add \(\lambda _ W\), weight decay does not modify the gradients, but instead subtracts \(\alpha _ \lambda W\) from the weights in the update step.

\begin{equation} W = W - \alpha \cdot \bigtriangledown_\theta J(\theta) - \alpha \cdot \lambda \cdot W \end{equation}

Dropout

Dropout is a regularization technique that during training, randomly sets neuron outputs to zero. This prevents the network from overreliance on any specific neuron, reducing overfitting. More specifically, every epoch, dropout randomly eliminates some nodes in the network. This could be thought of as transforming the model into the bagging model since the nodes are dropped at random and there are various different versions of the neural network in the training, with the result being an "average" of the networks. Therefore, randomly disable nodes has an effect that is similar to bagging and thus, reduces variance and improves generalizability.

Batch Normalization

This is a technique where the output of a layer is normalized by subtracting the mean and dividing by the standard deviation before passing it to the next layer. This helps to stabilize the training process and prevent overfitting. More specifically, subtracting mean and dividing by the standard deviation introduces the noise into the system. This is the case since the mini-batch is often chosen at random. While this feature makes Batch Normalization a natural regularizer, it is also considered a drawback since normalization for any given sample is not deterministic as it depends on the mini-batch (it is part of) selected at random.

Tree Pruning

Pruning is a common regularization approach for tree-based models (e.g., Decision Tree, Random Forest, etc). It reduces the size of trees by removing parts of the tree that do not provide power to classify instances.

Early Stopping

Early Stopping is a regularization technique, where training is stopped when the model performance on the validation dataset stops improving. This helps with both detecting and dealing with overfitting.

More Training Data

It could be that the training data that we have is biased and thus, more data needs to be added. This can be done by augmenting the existing data or through data collections efforts.