Gradient Descent

Table of Contents

Short Summary: Gradient descent is a widely used optimization algorithm for efficiently training machine learning models. There are three primary categories of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.

Definitions

Batch Gradient Descent

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta) \end{equation}

Stochastic Gradient Descent

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i)}};{y^{(i)}}) \end{equation}

Mini-Batch Gradient Descent

\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i:i+n)}};{y^{(i:i+n)}}) \end{equation}

Advanced Gradient Descent Variants

Momentum

\begin{align} v_t &= {\beta \cdot v_{t-1}} + \alpha \cdot \bigtriangledown_\theta J(\theta)\\ \theta &= \theta - v_t \end{align}

Nesterov Accelerated Gradient (NAG)

\begin{align} v_t &= \beta \cdot v_{t-1} + \alpha \cdot \bigtriangledown_\theta J(\theta - {\beta \cdot v_{t-1}})\\ \theta &= \theta - v_t \end{align}

Adagrad, Adadelta, and RMSProp

\begin{equation} \theta_i = \theta_i - \frac{\alpha}{G_t} \cdot \bigtriangledown_\theta J(\theta_i) \end{equation}

Adam, AdaMax, and Nadam