Short Summary: Gradient descent is a widely used optimization algorithm for efficiently training machine learning models. There are three primary categories of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.

### Definitions

• $$\theta$$ - Model parameters.
• $$J(\theta)$$ - Loss function.
• $$\bigtriangledown_\theta J(\theta)$$ - Gradient of the losss function w.r.t. the parameters $$\theta$$.
• $$\alpha$$ - Learning rate.
• $$n$$ - Mini-batch size.

• The entire dataset is used for updating the model parameters.
• Does not allow for online learning (i.e., on-the-fly updates with new samples).
• Slow and problematic if the entire dataset does not fit into memory.
• Converges to the global minimum for convex functions and to a local minimum for non-convex functions for "reasonable" learning rates.

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta) \end{equation}

• A single $$(x, y)$$ feature-label pair is used for updating the model parameters.
• Allows for online training and faster than Batch Gradient Descent (fits into memory).
• Loss function fluctuates heavily with each iteration resulting in high variance updates.
• Might fail to converge exactly due to fluctuations, but has a chance of jumping out of a local minimum and converging to a better one.

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i)}};{y^{(i)}}) \end{equation}

• A mini-batch (usually from (n = 8) to (n = 512)) is used for updating the model parameters.
• Allows for online training and is fast as it leverages highly optimized tensor operations.
• Reduces the fluctuations of Stochastic Gradient Descent.
• Does not guarantee good convergence out-of-the-box and utilizes advanced techniques.

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i:i+n)}};{y^{(i:i+n)}}) \end{equation}

#### Momentum

• Helps getting out of plateaus and ravines, resulting in getting stuck in the local minima.
• Accelerates in the relevant direction and reduces oscillations in the others.
• The coefficient $$\beta$$ is in the $$(0, 1)$$ interval.

\begin{align} v_t &= {\beta \cdot v_{t-1}} + \alpha \cdot \bigtriangledown_\theta J(\theta)\\ \theta &= \theta - v_t \end{align}

• Momentum could gain "too much speed" in one direction and overshoot past the minimum (on that dimension).
• Slows down before the slope changes Need to "understand" its direction and the path ahead.
• Approximates the next position of the parameters on the function Calculate gradient on that approximation of the parameters and no the current ones.
• Approximates the next position of the parameters.
• The gradient is missing for the full update.

\begin{align} v_t &= \beta \cdot v_{t-1} + \alpha \cdot \bigtriangledown_\theta J(\theta - {\beta \cdot v_{t-1}})\\ \theta &= \theta - v_t \end{align}

• All parameters share the same learning rate (not ideal for sparse data).
• Use different learning rates for each parameters.
• Performs smaller updates for frequent parameters.
• Performs larger updates for infrequent parameters.
• Allows for better convergence.
• For the most part, eliminates the need for tuning the learning rate.
• $$G_t$$ is the sum of the gradients.

\begin{equation} \theta_i = \theta_i - \frac{\alpha}{G_t} \cdot \bigtriangledown_\theta J(\theta_i) \end{equation}

• Uses $$L_{\infty}$$ norm.