# No more confusion on Backpropagation

## Backpropagation from scratch and handy

I always wanted to implement backpropagation on a simple network to understand its working. To be honest, I struggled for a week to implement it from scratch as I need to understand the flow of partial derivatives with the help of the chain rule before I step in. I wanted to implement it on a very basic network to understand the flow.

I agree with the fact that we need not learn **‘How to implement a backpropagation algorithm’** because we *never* code it practically in the real world, we rather take the help of libraries like Keras, Tensorflow, or Pytorch, whatever is preferable while solving any deep learning problems. But, understanding backpropagation working will help to understand the *‘Vanishing gradient problem’* — the reason for the success of the **ReLU** activation function and ‘*Exploding gradients’* — which is often observed in RNN.

This is an excellent article on Why you should understand backdrop.

I believe the cutting-edge deep learning networks, which have been in use and research are only based on backpropagation. I have been using the **Keras** library and training the neural networks. But, when I implemented backpropagation on the simple network — I thought the team who developed **Tensorflow** and **Keras** created an **engineering marvel** behind it. I am sure, by the end of this article, you would also appreciate the effort put in by the developers to code it.

**NOTE: **There would be a lot of handwritten notes and screenshots to be attached to make it easier to understand.

## Why do we need backpropagation?

The goal of the backpropagation is to **minimize the loss function** *by adjusting the neural network’s* **weight **and **bias**.

To simplify it — To find the best weights for a model. So, our trained model will perform with minimum error without overfitting or underfitting. Underfitting — would never be the case with a deep learning model (as the name itself suggests that — it is a deep one😀).

Finding the best weights of a model is done with the help of an update equation (a part of the Gradient descent algorithm). In fact, backpropagation is also a part of Gradient descent.

**Activation function?**

It could be thought of as a neuron in the brain, which decides — **“what to be sent to the next neuron”**.

It is a function, which is added to the artificial network — to make the network learn **“complex patterns in the data”**. These functions are *mathematical expressions *that determine the output of a network.

*Sigmoid, Tanh*, and *ReLU* are a few examples of differentiable activation functions. **Why they should be differentiable functions?** — The aim itself is to update weights, if the function is not differentiable then *how do we update weights?* Having said that, the above-mentioned activation functions are also having drawbacks but, those can be fixed.

**Considering a network to understand backpropagation**

This is a **Multi-Layered-Perceptron** network created for a **regression** problem. It is having *2 hidden layers* and *1 output layer *and *3 inputs* (which means, the input data is of *3 dimensions*).

→Considering **ReLU** and **Sigmoid **as activation functions for *Hidden layer-1* and *Hidden layer-2.*

→In the output layer, the **Identity function** is being used. Identity function i.e, f(z) = z {*whatever be the input to the function, it outputs the same value*}.

→Identity function is also called** **the** linear activation function**.

→It can **only be used in the output layer** of the neural network model, which solves the **regression problem.**

**NOTE: **In general, all neural networks — be it MLP, CNN, RNN, or LSTM, would consist of bias terms but for ease of calculations, not considering bias terms anywhere in this article.

## Step-1: Initialize weights

Initialization of weights is one of the important aspects while training a network, there are various optimal weight initialization methods like **Xavier **and** He-Normal**. In this example, we are *randomly initializing weights.*

It is clear from the above images about *the input the network takes in *and *the actual output*. We are solving this **only for one data point in the entire dataset**.

## Step-2: Feed-forward propagation

We have been discussing a lot about backpropagation but in order to go back, we need to step ahead first — only then, we can take a back step, if the forward step is incorrect. Similarly, **backpropagation can only be implemented after implementing feed-forward propagation.**

The feed-forward network will provide input to the network and the output received from the feed-forward step will be used to calculate the loss.

Before applying the steps, let’s make our network diagram a bit more intuitive.

**Forward propagation of hidden layer-1**

**Forward propagation of hidden layer-2**

**Forward propagation of the output layer**

Forward propagation is done, based on our input and weight matrices, and the network gave output = **0.256** (which is a** predicted output**).

But, our **actual output** should be **0.5**. So, we calculate error using the loss function and in this example, we are considering **squared error** as the loss function. As mentioned above, this entire example is based on one data point in the dataset. If it is for all data points or a batch, then the loss function would **mean squared error**.

## Step-3: Calculate the loss

Our goal is to minimize the loss as much as possible (attaining the loss to 0 is near to impossible in any Deep Learning or Machine Learning problems).

## Step-4 : Backpropagation

In forward propagation, we computed to forward in the network. In the backpropagation(as the name suggests), we would compute backward from here.

**Solving for the Output layer to hidden layer-2**

Updation of weights is to be done using an update equation. The learning rate is one of the parameters in the update equation. It is also the most important parameter because, at times, it might lead to an **oscillating problem**, if it is not chosen in an appropriate way. There are many techniques to choose the learning rate such as **momentum-based** or t**ime-based **or **exponential**.

**Solving for the hidden layer-2 to hidden layer-1**

Updation of weights is to be done using an update equation.

**Solving for the hidden layer-1 to the Input layer**

Updation of weights is to be done using an update equation.

This is how the backpropagation works (weights being updated in the network).

**Add on — What if we consider a batch gradient descent for backpropagation?**

Assume, our dataset has 10000 data points and we divide it into 100 batches with each batch consisting of 100 data points. So, 100 batches become 1 epoch (*when all the training data has been passed through the network once*).

**Feed-forward propagation** — Each batch will be sent into the network as **input** and **loss** will be computed for that whole batch — which means, each data point in a batch will be passed through the network (as seen in the above-illustrated example), *the network will compute the loss function* once all the data points have been feed-forwarded in that batch — which is often called as **loss averaging** of *batch gradient descent.*

**Backpropagation** — The **gradient** will be computed only once for the average loss (which is computed batch-wise) — These gradients will be plugged into the *update equation* to get updated weights.

*The **gradient** will be computed only once for the average loss (which is computed batch-wise)* — **How this is done? — **The total gradient is the sum of the gradient of each sample in the batch, which means, the gradient of the batch is the *summation* of the gradient of individual data points in a batch.

## Add on — What is the Vanishing gradient problem?

This is the biggest problem, which troubled deep learning research for a decade.

**This represents the generalized view of it.**

**Let’s understand it with the above-illustrated example**

Assume the *activation functions* in hidden layer-1 is also **Sigmoid** like that of hidden layer-2, then the gradient would very very small, where the *older gradient* and *updated gradient* are *too close*, which shouldn’t be the case because we need to find a gradient that would be close to the one, which fits our model with minimum error.

because of the *sigmoid* and* tanh ***activation functions**, the gradients would be *very small*, which doesn’t help the model to find the **optimal W**. — this is called as **Vanishing gradient problem.**

The above example, we are observing is just 3 layers with very few activation functions in each layer but, *if there are 10 layers and more than 10 activation functions in the network*, then this would be like an *infinite loop in programming*, **as it never converges to optimal w**.

**Exploding gradient problem** is **when the gradients are too large.** Here, you can find a good explanation of it.

**ReLU** and *Variants of ReLU* (**Leaky ReLU**, **Noisy ReLU**, etc…) helped in overcoming the vanishing gradients and exploding gradients problem. These activation functions are described well here.

## Add on — Overfitting exists but not underfitting.

The name Deep Learning suggests that networks are often trained with deeper layers to find the best model. But, if the *number of layers increases*, there would be a *higher probability that our model would overfit*. For instance, if our dataset consists of 500 data points and training it on a 20-layer deep network with more than 300 weights, then the probability of our model overfitting is too high.

Overfitting means high variance — which denotes that the model **performs with high accuracy on the training set but poorly on the testing or cross-validation set**. I believe such models are good for nothing and need to find the **best-generalized model**. A model is said to be generalized when it performs well on both seen data (training data) and unseen data (test and cross-validation data).

Underfitting — unless we train a network with irrelevant features or if the data is too complex to read by the model — underfitting exists. It would a rare scenario to notice in deep learning.

## Add on —Is Dropout solution for Overfitting?

As discussed above, *Overfitting* is often seen in Neural networks because of the deeper layers.

Dropout is dropping out a few layers (Input layer to output layer) to rather let the network introduce randomization to overcome overfitting.

It is one of the examples of MLP, as seen above. If a *Dropout** *is applied to the above MLP.

There are a few layers in the **Input layer **that are *disabled***. **Similarly few** ***activation units*** **in** hidden layers** are also** ***disabled*** **(no input to that unit as well as no output from that unit) — *only for one iteration*, in the next iteration, a different set of units will be disabled.

If we use Batch SGD for backpropagation, then

- A certain percentage (p) of the neurons and their connections will be randomly selected and retained in the network.
- Inactive neurons or activation units and their associated weights will never update in that batch.

This holds true for Standard Gradient Descent and Stochastic Gradient Descent too.

*“p”* in the **Dropout — **which demonstrates **the probability (p) of retaining a unit**. “p” lies between 0 and 1 (can also be 0 and 1). If “p”=1, then **no dropout,** and if “p” is *low, *then* ***more dropouts.**

*“1-p”* is considered the **Dropout rate**. “1-p” lies between 0 and 1 (can also be 0 and 1). If “Dropout rate=0.8” i.e., (1-p) — which says, *80% of total input/activation functions are inactive*.

“p” is the hyper-parameter in Dropout, which needs to be found using cross-validation techniques.

**What happens during Train time?**

During **training***, *the network follows forward propagation and backward propagation by **discarding **the dropout connections (weights) — which says that *the weights are present in the network with probability “p”*.

**What happens during Test time?**

During **testing **with a query data point, **the weights are always present** BUT, the *weights should be multiplied with “p” *while finding the ‘*target for query data point’*.

*Overfitting* in* Neural Networks* usually happens *when the ***network ***is ***too deep** and *when there are ***more trainable parameters** than *compared* to the i*nput data points*. **Dropouts** can help us in such scenarios, but that doesn’t mean their scope is only limited to overfitting.

## Excluded while explaining the above concepts

*Didn’t consider the bias term, initialization of weights, and selection of learning rate parameters are the hyper-parameters that are to be found using cross-validation techniques.*