Optimizers are variants of Gradient Descent

SGD with momentum, Nesterov Accelerated Gradient Descent, Adagrad, Adadelta and Adam

8 min readFeb 1, 2023

Optimizers are at the core of Deep Learning algorithms and in fact, it is like a heart to the human body. Without optimizers, there isn’t any Deep Learning algorithm exists.

All the optimizers are enhanced versions of the Gradient Descent algorithm. So, understanding How Gradient Descent works!! would help while following through with this article.

Gradient Descent is behind Machine Learning and Deep Learning

Gradient Descent is simple and basic for Deep Learning algorithms

mvschamanth.medium.com

Let’s understand the example between Standard Gradient Descent, Stochastic Gradient Descent, and Mini-batch Stochastic Gradient Descent with an example.

Example

Let’s say there is a dataset of 10000 datapoints and Gradient Descent is typically used to find the best weights by using its update equation. Forward propagation — Loss Function — Backward propagation is the flow for any Deep Learning algorithm.

When all the 10000 datapoints are fed into the network all at once and a loss function is computed and follows backpropagation, then it is called a Standard Gradient Descent. When only one datapoint is picked at random and fed into the network and computes loss and updates weights through an update equation, then it is called a Stochastic Gradient Descent. When a batch of 128 random datapoints is fed into the network and computes loss and then updates weights by using an update equation, then it is called a Mini-batch Stochastic Gradient Descent.

If a batch of random datapoints is used in finding the best weights, then it is called a Mini-batch Stochastic Gradient Descent.
If only one datapoint is picked at random and is used in finding the best weights, then it is called a Stochastic Gradient Descent.
If all the datapoints in the dataset are used in finding the best weights, then it is called a Standard Gradient Descent.

The Weights of gradient descent are not exactly equal to, but similar to the Weights of stochastic gradient descent Similarly, the Weights of gradient descent are not exactly equal to, but similar to the Weights of mini-batch gradient descent.

But, why? Let’s understand it logically — Assume, there is a task to find the average height of women in India. Is it practically possible to do it by considering all the women in the country? It isn’t because, India has a population of more than 139 crores, where at least 70 crores of them are women. So, in order to find the average height of women, a sample is chosen with a specific sample size, and calculate the average on the sample and approximate it to the height of all the women in the country. It’s an approximation but not an exact value. For instance, If the average height of the sample chosen is 154 cm. That doesn’t mean, the average height of all the women in the country is exactly 154 cm. With the great means, if you calculate the average height of all the women and found it to be 153.2cm, then 0.8cm is the error or noise associated with the sample mean. But, the sample mean helps us understand that the population means would be around the sample mean. So, it is a good approximation.

Similarly, it holds true for Standard Gradient Descent and Mini-batch Stochastic Gradient Descent. So, SGD with momentum, NAG, AdaGrad, RMSProp, and Adam are the variants of SGD, which helps in denoising the noise from mini-batch at least to some extent.

Why mini-batch Stochastic Gradient Descent is most used?

From the above image, it is clear that Gradient Descent takes fewer steps and a straight path (without any noise) to reach the (+)loss function minima, whereas SGD takes many steps (oscillations) and (also a small part of noisy steps) to reach convergence. But, each step is a lot faster to compute for Stochastic Gradient Descent than for Gradient Descent, as SGD uses only one training example (contrary to the whole batch for GD).

In Mini-Batch Gradient Descent, it takes fewer steps than SGD and also less noise is added to the path.

NOTE: All the concepts are discussed with respect to mini-batch SGD, even if mentioned as SGD

SGD with momentum

Before stepping on this, let’s understand the Exponential Moving Average. If the variables in the exponential moving averages are replaced with values, then it becomes an Exponential-Weighted Moving Average.

So, the quick question is ‘What is the exponential moving average’?

When gamma = 0 → Only datapoint at that particular time is considered.

When gamma = 1 → It has become an equally weighted value, where all the data points are having same weight at time t=3.

When gamma = 0.5 → The most recent datapoint at time t=3 is x3 and it has the highest importance or most weight, whereas the x1 is datapoint at time t=1, so it has been given the least importance when compared to the other two data points because it was already been used and updated when time t=1 and time t=2. Similarly, this would be the case for any value, when gamma strictly lies between 0 and 1 but not exactly 0 and not exactly 1.

So, it can be generalized as

In Gradient Descent update equation is discussed in detail. Using it here to derive the SGD with momentum.

to find W_t, we need to compute the gradient at t, using W at t-1. So, representing the gradient of W_t-1 as g_t (as in the below image)

Using the exponential moving average concept with the Update equation

Understanding it with different cases

So, I read SGD with momentum as

Adding the Exponential moving average in the SGD’s weight update formula.

All the Weights and Gradients are vectors (as there would be many weight matrices depending on the number of layers in the network and Gradients are computed using those weights).

Vector representation of SGD with momentum

What is the use of SGD with momentum?

To lessen the noise (or error) obtained from SGD and for faster convergence.

This is an excellent article on Why momentum works?

Nesterov accelerated gradient (NAG)

This algorithm was proposed and developed by Russian mathematician Yurii Nesterov. Nesterov Momentum is a slightly modified form of the momentum update.

In this algorithm, initially, we need to proceed with the momentum and then compute the momentum at that point, which is W`(in the above figure). The resultant W_t will be obtained from the momentum term and the gradient computed at W`.

Adaptive gradients (Adagrad)

In SGD with momentum and SGD, η is made constant. In Adagrad, there is an enhancement to η to make it work more efficiently.

But, what is the problem, if η is made constant? — It might not have much effect if the features are dense but when the features are sparse, it really has an impact. As, we tend to use, mini-batch SGD, it has even more effect.

Why is there a problem with sparse features?

In the Gradient Descent blog, discussed Linear Regression with an Optimization example, using it here.

Adagrad helps in fixing such problems.

How does Adagrad work?

Adagrad algorithm works better when there are few sparse features also and in deep learning, as the network is deep, there could be at least a few sparse features. In such cases, this algorithm doesn’t make our model fail, and also manual tuning of the learning rate is no more required, to avoid oscillating kind of problems. These are the problems, which this algorithms fixes, which are the disadvantages of other algorithms.

But, if the number of iterations(t) increases, then alpha_t increases and becomes too large, then η` decreases, which can create slower convergence — This is the disadvantage.

Adadelta

The only disadvantage of Adagrad is slower convergence. Adadelta can help in fixing this issue.

The difference between Adagrad and Adadelta is that alpha_t is replaced with eda_t (exponential decay average)

The issue with Adagrad is that alpha_t can become too large, which results in slower convergence because the growth of alpha_t can’t be controlled whereas, eda_t can be controlled, which can halt the growth of the denominator in η`.

Adadelta can be explained as — considering exponentially weighted averages of gradient squares at each iteration instead of the sum of gradient squares (as in Adagrad).

Adadelta algorithm is having all the benefits of Adagrad, in addition to resolving slower convergence issues of Adagrad.