No more confusion on Backpropagation

Backpropagation from scratch and handy

I always wanted to implement backpropagation on a simple network to understand its working. To be honest, I struggled for a week to implement it from scratch as I need to understand the flow of partial derivatives with the help of the chain rule before I step in. I wanted to implement it on a very basic network to understand the flow.

I agree with the fact that we need not learn ‘How to implement a backpropagation algorithm’ because we never code it practically in the real world, we rather take the help of libraries like Keras, Tensorflow, or Pytorch, whatever is preferable while solving any deep learning problems. But, understanding backpropagation working will help to understand the ‘Vanishing gradient problem’ — the reason for the success of the ReLU activation function and ‘Exploding gradients’ — which is often observed in RNN.

This is an excellent article on Why you should understand backdrop.

I believe the cutting-edge deep learning networks, which have been in use and research are only based on backpropagation. I have been using the Keras library and training the neural networks. But, when I implemented backpropagation on the simple network — I thought the team who developed Tensorflow and Keras created an engineering marvel behind it. I am sure, by the end of this article, you would also appreciate the effort put in by the developers to code it.

NOTE: There would be a lot of handwritten notes and screenshots to be attached to make it easier to understand.

Why do we need backpropagation?

The goal of the backpropagation is to minimize the loss function by adjusting the neural network’s weight and bias.

To simplify it — To find the best weights for a model. So, our trained model will perform with minimum error without overfitting or underfitting. Underfitting — would never be the case with a deep learning model (as the name itself suggests that — it is a deep one😀).

Finding the best weights of a model is done with the help of an update equation (a part of the Gradient descent algorithm). In fact, backpropagation is also a part of Gradient descent.

Activation function?

It could be thought of as a neuron in the brain, which decides — “what to be sent to the next neuron”.

It is a function, which is added to the artificial network — to make the network learn “complex patterns in the data”. These functions are mathematical expressions that determine the output of a network.

Sigmoid, Tanh, and ReLU are a few examples of differentiable activation functions. Why they should be differentiable functions? — The aim itself is to update weights, if the function is not differentiable then how do we update weights? Having said that, the above-mentioned activation functions are also having drawbacks but, those can be fixed.

Considering a network to understand backpropagation


This is a Multi-Layered-Perceptron network created for a regression problem. It is having 2 hidden layers and 1 output layer and 3 inputs (which means, the input data is of 3 dimensions).

→Considering ReLU and Sigmoid as activation functions for Hidden layer-1 and Hidden layer-2.

→In the output layer, the Identity function is being used. Identity function i.e, f(z) = z {whatever be the input to the function, it outputs the same value}.

→Identity function is also called the linear activation function.

→It can only be used in the output layer of the neural network model, which solves the regression problem.

NOTE: In general, all neural networks — be it MLP, CNN, RNN, or LSTM, would consist of bias terms but for ease of calculations, not considering bias terms anywhere in this article.

Step-1: Initialize weights

Initialization of weights is one of the important aspects while training a network, there are various optimal weight initialization methods like Xavier and He-Normal. In this example, we are randomly initializing weights.

Weight initialization sequence from left to right

It is clear from the above images about the input the network takes in and the actual output. We are solving this only for one data point in the entire dataset.

Step-2: Feed-forward propagation

We have been discussing a lot about backpropagation but in order to go back, we need to step ahead first — only then, we can take a back step, if the forward step is incorrect. Similarly, backpropagation can only be implemented after implementing feed-forward propagation.

The feed-forward network will provide input to the network and the output received from the feed-forward step will be used to calculate the loss.

Before applying the steps, let’s make our network diagram a bit more intuitive.

Forward propagation of hidden layer-1

Forward propagation of hidden layer-2

Forward propagation of the output layer

Forward propagation is done, based on our input and weight matrices, and the network gave output = 0.256 (which is a predicted output).

But, our actual output should be 0.5. So, we calculate error using the loss function and in this example, we are considering squared error as the loss function. As mentioned above, this entire example is based on one data point in the dataset. If it is for all data points or a batch, then the loss function would mean squared error.

Step-3: Calculate the loss

Our goal is to minimize the loss as much as possible (attaining the loss to 0 is near to impossible in any Deep Learning or Machine Learning problems).

Step-4 : Backpropagation

In forward propagation, we computed to forward in the network. In the backpropagation(as the name suggests), we would compute backward from here.

Solving for the Output layer to hidden layer-2

weight update sequence — from left to right

Updation of weights is to be done using an update equation. The learning rate is one of the parameters in the update equation. It is also the most important parameter because, at times, it might lead to an oscillating problem, if it is not chosen in an appropriate way. There are many techniques to choose the learning rate such as momentum-based or time-based or exponential.

(left) update equation and (right) updated weights

Solving for the hidden layer-2 to hidden layer-1

sequence from top left to bottom right

Updation of weights is to be done using an update equation.

(left) update equation and (right) updated weights

Solving for the hidden layer-1 to the Input layer

sequence from top left to bottom right

Updation of weights is to be done using an update equation.

This is how the backpropagation works (weights being updated in the network).

Add on — What if we consider a batch gradient descent for backpropagation?

Assume, our dataset has 10000 data points and we divide it into 100 batches with each batch consisting of 100 data points. So, 100 batches become 1 epoch (when all the training data has been passed through the network once).

Feed-forward propagation — Each batch will be sent into the network as input and loss will be computed for that whole batch — which means, each data point in a batch will be passed through the network (as seen in the above-illustrated example), the network will compute the loss function once all the data points have been feed-forwarded in that batch — which is often called as loss averaging of batch gradient descent.

Backpropagation — The gradient will be computed only once for the average loss (which is computed batch-wise) — These gradients will be plugged into the update equation to get updated weights.

The gradient will be computed only once for the average loss (which is computed batch-wise)How this is done? — The total gradient is the sum of the gradient of each sample in the batch, which means, the gradient of the batch is the summation of the gradient of individual data points in a batch.

Add on — What is the Vanishing gradient problem?

This is the biggest problem, which troubled deep learning research for a decade.

This represents the generalized view of it.

vanishing gradient problem

Let’s understand it with the above-illustrated example

Assume the activation functions in hidden layer-1 is also Sigmoid like that of hidden layer-2, then the gradient would very very small, where the older gradient and updated gradient are too close, which shouldn’t be the case because we need to find a gradient that would be close to the one, which fits our model with minimum error.

Sequence (from left to right)

because of the sigmoid and tanh activation functions, the gradients would be very small, which doesn’t help the model to find the optimal W. — this is called as Vanishing gradient problem.

The above example, we are observing is just 3 layers with very few activation functions in each layer but, if there are 10 layers and more than 10 activation functions in the network, then this would be like an infinite loop in programming, as it never converges to optimal w.

Exploding gradient problem is when the gradients are too large. Here, you can find a good explanation of it.

ReLU and Variants of ReLU (Leaky ReLU, Noisy ReLU, etc…) helped in overcoming the vanishing gradients and exploding gradients problem. These activation functions are described well here.

Add on — Overfitting exists but not underfitting.

The name Deep Learning suggests that networks are often trained with deeper layers to find the best model. But, if the number of layers increases, there would be a higher probability that our model would overfit. For instance, if our dataset consists of 500 data points and training it on a 20-layer deep network with more than 300 weights, then the probability of our model overfitting is too high.

Overfitting means high variance — which denotes that the model performs with high accuracy on the training set but poorly on the testing or cross-validation set. I believe such models are good for nothing and need to find the best-generalized model. A model is said to be generalized when it performs well on both seen data (training data) and unseen data (test and cross-validation data).

Underfitting — unless we train a network with irrelevant features or if the data is too complex to read by the model — underfitting exists. It would a rare scenario to notice in deep learning.

Add on —Is Dropout solution for Overfitting?

As discussed above, Overfitting is often seen in Neural networks because of the deeper layers.

Dropout is dropping out a few layers (Input layer to output layer) to rather let the network introduce randomization to overcome overfitting.

This is a standard neural network

It is one of the examples of MLP, as seen above. If a Dropout is applied to the above MLP.

Fig — Dropout-enabled network

There are a few layers in the Input layer that are disabled. Similarly few activation units in hidden layers are also disabled (no input to that unit as well as no output from that unit) — only for one iteration, in the next iteration, a different set of units will be disabled.

If we use Batch SGD for backpropagation, then

  1. A certain percentage (p) of the neurons and their connections will be randomly selected and retained in the network.
  2. Inactive neurons or activation units and their associated weights will never update in that batch.

This holds true for Standard Gradient Descent and Stochastic Gradient Descent too.

“p” in the Dropout — which demonstrates the probability (p) of retaining a unit. “p” lies between 0 and 1 (can also be 0 and 1). If “p”=1, then no dropout, and if “p” is low, then more dropouts.

“1-p” is considered the Dropout rate. “1-p” lies between 0 and 1 (can also be 0 and 1). If “Dropout rate=0.8” i.e., (1-p) — which says, 80% of total input/activation functions are inactive.

“p” is the hyper-parameter in Dropout, which needs to be found using cross-validation techniques.

What happens during Train time?

At train time

During training, the network follows forward propagation and backward propagation by discarding the dropout connections (weights) — which says that the weights are present in the network with probability “p”.

What happens during Test time?

At test time

During testing with a query data point, the weights are always present BUT, the weights should be multiplied with “p” while finding the ‘target for query data point’.

Overfitting in Neural Networks usually happens when the network is too deep and when there are more trainable parameters than compared to the input data points. Dropouts can help us in such scenarios, but that doesn’t mean their scope is only limited to overfitting.

Excluded while explaining the above concepts

Didn’t consider the bias term, initialization of weights, and selection of learning rate parameters are the hyper-parameters that are to be found using cross-validation techniques.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store