Types of Optimization Algorithm used to Train Neural Network.

Published in

Nerd For Tech

5 min readJun 25, 2021

Optimizers are algorithm or methods used to change the attributes of your neural networks such as weights and learning rate in order to reduce the losses.

How you should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizers you use. Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.

We’ll learn about different types of optimizers and their advantages:

Gradient Descent:

Gradient Descent is the most commonly used algorithm, it’s quite commonly used in Linear and Logistics Regression techniques.
It’s also known as Vannila Gradient Descent.
Gradient Descent is the first order derivative where it back propagates in neural network to minimize the loss by optimizing weight (w) and bias (b).

algorithm: θ=θ−α⋅∇J(θ)

Advantages:

Very easy to implement
Simple method to understand

Disadvantages:

Takes entire dataset to attain global minima which requires lots of computational power and a time consuming process.
It may struck in local minima before attaining global minima.

Stochastic Gradient Descent:

It optimizes the weight and loss for each rows in training dataset. If data has 1000 rows, it’ll change the weight and bias for loss calculated for each 1000 rows in one cycle whereas gradient descent takes entire data, so it requires less computational power, but consumes more time to run.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples

Advantage:

It requires less computational power.

Disadvantages:

Since it’s optimizing for each rows it requires more time to converge to global minima.
It may struck in local minima.
It has more noise in attaining global minima.

Mini- Batch Gradient Descent:

It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated.

θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.

Advantages:

It consumes lesser time to SGD to attain global minima.
It requires medium computational power.

Disadvantages:

Though it consumes less time than SGD, generally it requires more time to optimize the parameters.
It has more noise.

SGD with Momentum:

SGD with momentum is method which helps accelerate gradients vectors in the right directions, thus leading to faster converging.
The main disadvantage with SGD and Mini Batch SGD are noise, since it’s calculating for batch wise, it has more fluctuation, so we move to SGD with momentum.
It uses exponential weighted moving average to avoid noise in calculating gradient descent.

Advantages:

Reduces the oscillations and high variance of the parameters.
It consumes less time among all GD algorithm.

All types of Gradient Descent have some challenges:

Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge.
Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
May get trapped at local minima.

AdaGrad ( Adaptive Gradient):

One of the main disadvantage with all the gradient optimization algorithm is defining learning rate which is constant for each cycle.
The key idea of AdaGrad is to have an adaptive learning rate for each of the weights.
The learning rate for weight will be decreasing with the number of iteration.

A derivative of loss function for given parameters at a given time t.

Update parameters for given input i and at time/iteration t

Advantages:

It updates the learning rate adaptively for each iterations, we don’t need to mention the parameter.
It works well on sparse data.

Disadvantages:

As the number of iteration becomes very large learning rate decreases to a very small number which leads to slow convergence.
Computationally expensive process since it involves lots of math calculations

Adadelta:

It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.

E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)

Advantage:

Now the learning rate does not decay and the training does not stop.

Disadvantage:

Computationally expensive.

Adadelta and Rmsprop which is similar to each other works well before Adam got into play.

Adam:

Adaptive Moment Estimation (Adam) works with momentums of first and second order.
The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search.
In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients M(t).