Gradient Descent

Short Summary: Gradient descent is a widely used optimization algorithm for efficiently training machine learning models. There are three primary categories of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.

Definitions

\(\theta\) - Model parameters.
\(J(\theta)\) - Loss function.
\(\bigtriangledown_\theta J(\theta)\) - Gradient of the losss function w.r.t. the parameters \(\theta\).
\(\alpha\) - Learning rate.
\(n\) - Mini-batch size.

Batch Gradient Descent

The entire dataset is used for updating the model parameters.
Does not allow for online learning (i.e., on-the-fly updates with new samples).
Slow and problematic if the entire dataset does not fit into memory.
Converges to the global minimum for convex functions and to a local minimum for non-convex functions for "reasonable" learning rates.

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta) \end{equation}

Stochastic Gradient Descent

A single \((x, y)\) feature-label pair is used for updating the model parameters.
Allows for online training and faster than Batch Gradient Descent (fits into memory).
Loss function fluctuates heavily with each iteration resulting in high variance updates.
Might fail to converge exactly due to fluctuations, but has a chance of jumping out of a local minimum and converging to a better one.

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i)}};{y^{(i)}}) \end{equation}

Mini-Batch Gradient Descent

A mini-batch (usually from (n = 8) to (n = 512)) is used for updating the model parameters.
Allows for online training and is fast as it leverages highly optimized tensor operations.
Reduces the fluctuations of Stochastic Gradient Descent.
Does not guarantee good convergence out-of-the-box and utilizes advanced techniques.

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i:i+n)}};{y^{(i:i+n)}}) \end{equation}

Advanced Gradient Descent Variants

Momentum

Helps getting out of plateaus and ravines, resulting in getting stuck in the local minima.
Accelerates in the relevant direction and reduces oscillations in the others.
The coefficient \(\beta\) is in the \((0, 1)\) interval.

\begin{align} v_t &= {\beta \cdot v_{t-1}} + \alpha \cdot \bigtriangledown_\theta J(\theta)\\ \theta &= \theta - v_t \end{align}

Nesterov Accelerated Gradient (NAG)

Momentum could gain "too much speed" in one direction and overshoot past the minimum (on that dimension).
Slows down before the slope changes Need to "understand" its direction and the path ahead.
Approximates the next position of the parameters on the function Calculate gradient on that approximation of the parameters and no the current ones.
Approximates the next position of the parameters.
The gradient is missing for the full update.

\begin{align} v_t &= \beta \cdot v_{t-1} + \alpha \cdot \bigtriangledown_\theta J(\theta - {\beta \cdot v_{t-1}})\\ \theta &= \theta - v_t \end{align}

Adagrad, Adadelta, and RMSProp

Problem: All parameters share the same learning rate (not ideal for sparse data).
Use different learning rates for each parameters.
Performs smaller updates for frequent parameters.
Performs larger updates for infrequent parameters.
Allows for better convergence.
For the most part, eliminates the need for tuning the learning rate.
\(G_t\) is the sum of the gradients.

\begin{equation} \theta_i = \theta_i - \frac{\alpha}{G_t} \cdot \bigtriangledown_\theta J(\theta_i) \end{equation}

Adam, AdaMax, and Nadam

Adam (Adaptive Moment Estimation) is RMSProp + Momentum.
AdaMax (Stable Adam) solves some of the numerical stability and bias issues in Adam.
- Uses \(L_{\infty}\) norm.
Nadam (Nesterov-accelerated Adaptive Moment Estimation) is Adam + NAG.