Zero weight initialization on the layer with Relu activation fails miserably

Why does ‘zero’ weight initialization fail? What’s the optimized way?

Chamanth mvs

6 min readNov 27, 2022

The end goal of any Machine Learning and Deep Learning algorithm is about finding the best weights.

which in fact is dependent on the Initialization of the weights —it is the starting step of finding the weights.

What happens with Zero initialization?

Let’s try to initialize a simple 1-layer network with all the weights to Zero value.

Case-1: Using ReLU activation function in the Hidden layer-1.

Initializing the Input and all the weights to Zero.

Forward Propagation: As all weights are initialized to 0’s, the input to the next layers will be the same for all the nodes. The outputs of the nodes would be 0, as we are using the ReLU activation function.

All the inputs (except the first layer which takes the actual inputs) will be 0, and all the outputs will be the same (0 for ReLu activation and also for tanh activation unit).

Compute the loss function: Considering Squared error as the loss function.

Backward propagation and Updating weights

From the above, It is very clear that No learning is done in the backpropagation and failed miserably when used ReLU activation unit.

Case-2: Using Sigmoid activation function in the Hidden layer-1.

Forward Propagation: As all weights are initialized to 0’s, the input to the next layers will be the same for all the nodes. The outputs of the nodes would be 0.5, as we are using the Sigmoid activation function.

All the inputs (except the first layer which takes the actual inputs) will be 0, and all the outputs will be the same (0.5 for Sigmoid activation and also for).

Backward propagation and Updating weights

Backward propagation using the Sigmoid activation unit

Weights at the deep layer haven’t had any effect but the weights at the outermost layer (at the output layer) had a slight update. But, if the network is trained deep, then the vanishing gradient problem and zero weight initialization will make the model even worse.

From both cases, it is clear that — all the nodes will end up with the same value and even the gradients — which is called a Symmetry problem. Even, if we have multiple nodes in the layer — it is as same as just having one node in the hidden layer because the network ended up learning just one function in case of Sigmoid activation and no learning in case of ReLU activation.

When all initial values are identical, for example, initialize every weight to 0, then when doing backpropagation, all weights will get the same gradient, and hence the same update. This is what is referred to as symmetry.

The same problem happens even with Constant value weight initialization.

Breaking Symmetry — What and How? : Because of Zero-weight initialization (and) Constant value weight initialization — all the nodes will have the same value (either in a forward prop or backward prop), that shouldn’t happen because the network should learn different kinds of features every time. This can be achieved by random initialization, as the gradients will be different, and each node will be updated to become much more distinct from other nodes — this happens through diverse feature extraction.

Understanding the problem in a logical way — Let’s assume, a child is learning to play cricket, it is a sport that involves a lot of physical effort to learn it, and also mental strength to play it well. When we are bowling a ball with the right hand, then the weight associated with the neuron to it should be obviously higher than the weight associated with the left hand. But, what if all the motor neurons in his brain are having equal weights, then How can learning happen? He wanted to become an expert in right-arm fast bowling, but if the weights are distributed equally, then he might be able to bowl with his right hand, but he can’t be trained to do variations on bowling, which is good for nothing.

Random Initialization is the solution?

Yes, random initialization could break the symmetry. But, How random?

What if weights are initialized to be large values?

Large -ve values, if the activation function used is ReLU, then it can lead to a similar problem as above, where No Learning happens because

So, it clearly conveys that

→Weights should be small (but not too small)

→Weights shouldn’t be Zero (for all the weights in the matrix, few of them could be with Zero value).

→Weights should be picked with good variance

So, initializing the weights from Gaussian Normal Distribution is one of the optimal ways of choosing weight.

Researchers have made a lot of experiments and found better weight initialization strategies.

Uniform Initialization

This weight initialization strategy works well for Sigmoid activation units.

Xavier/Glorot Initialization

This weight initialization strategy also works well for Sigmoid activation units. There are two types of initialization ways in this.

Initializing weights from Normal distribution

Initializing weights from Uniform distribution

He Initialization

This weight initialization strategy works well for ReLU activation units. There are two types of initialization ways in this.