Deep Learning — a foundation

Published in

Nerd For Tech

7 min readJun 18, 2021

The whole idea behind deep learning is to have the computers artificially mimic biological natural intelligence, so let’s build a general understanding of how biological neurons works.

This is biological neurons, the main thing to note here is various parts are connected to nucleus and the axon takes the part output from nucleus.

This is biological illustration where Dendrites brings the input signals, then it feed the signal to Nucleus where mathematical calculation occurs and Axon takes that output to another dendrite.
This is basic idea of biological neuron model and it is converted into mathematical neuron where the perceptron comes into play.
A perceptron was a form of neural network formed in 1958 by Frank Rosenblatt. He saw the huge potential at that time and stated with the quote ‘perceptron may eventually be able to learn, make decisions, and translate languages’. A great example is Google translate uses neural networks to translate languages.
In 1969 Marvin Minsky and Seymour Papert’s published their book Perceptrons. The major drawback suggested in their book was computational power, back in 1970’s computational power was very low, so it’s tough to build multi layer neural network. It took time to get popular.
Here comes the next question ‘why do we need DL, since we have ML?’

Machine learning performance gets saturated at particular amount of data, whereas the Deep Learning performance increases exponentially with amount the data.
The performance is same for small training sets. This era is full of data, so DL plays a vital role in getting insight from data.
Now we can replace biological neurons into mathematical unit.

This is single neuron, where it has two input features and a output.

The weights are multiplied with the input feature, so that we can optimise the weight value to reduce the losses.
A bias is added to the term to fix the minimal value for the output. Optimised value of weight and bias are found by gradient descent, it can be both negative and positive value.
Neuron does the linear calculation and activation function, that output is passed as a input of another neuron.

Figure 6: Generalised formula for Linear function.

A single neuron won’t be enough to learn the complicated systems. Fortunately, we can expand on the idea of a single perceptron, to create a multi-layer perceptron model.
Neural Network is a type of machine learning architecture modelled after biological network.
Deep Learning is the process of training neural network with more than one hidden layer.

To build a network of perceptrons, we can connect layers of perceptrons, using a multi-layer perceptron model.
The first layer is the input layer, that directly accepts real data values.
The last layer is the output layer and it can be more than one neuron depending on the multi label to predict.
Layers in between the input and output layers are the hidden layers. Hidden layers are difficult to interpret, due to their high interconnectivity and distance away from known input or output values. Neural Networks become “deep neural networks” if then contain 2 or more hidden layers.

Figure 8 : Non-deep Neural Network vs Deep neural network.

We saw that the perceptron itself contained a very simple summation function f(x). For most use cases however that won’t be useful, we’ll want to be able to set constraints to our output values, especially in classification tasks.
It would be useful to have all outputs fall between 0 and 1. That’s where activation function comes to play.

Activation Function:

Sigmoid function or Logit function gives the value where it ranges between 0 and 1. It has cut-off point at 0.5.
The prob. value > 0.5 is 1 and the prob. value < 0.5 is 0.

Hyperbolic Tanget function or Tanh function gives the output between -1 and 1 instead of 0 to 1.
The major drawback with Tanh and sigmoid function it get saturated at particular point.
During back propagation the derivative value becomes smaller, which slows down the process of gradient descent optimisation. To overcome this we have some other activation function like ReLU and Leaky ReLU.

Rectified Linear Unit is relatively simple unit.
It gives output as

0 if the value Z ≤ 0

Z if the value Z > 0.

Leaky- ReLU is similar to ReLU except for values less or equal to zero, it gives output as 0.01*x which resulting in speed up the process of optimisation.
ReLu and Leaky-ReLu has been found to have very good performance, especially when dealing with the issue of vanishing gradient which occurs due to small value of derivatives from sigmoid and tanh that slows down the process of optimisation.

Cost Function:

Neural networks take in inputs, multiply them by weights, and add biases to them. Then this result is passed through an activation function which at the end of all the layers leads to some output.
We need to take the estimated outputs of the network and then compare them to the real values of the label.
The cost function must be an average so it can output a single value. The loss function is a single value calculated for 1 epoch (One forward + One Backward propagation).

→ y to represent the true value

→ a to represent neuron’s prediction

→ w*x + b = z

→ Pass z into activation function σ(z) = a

We simply calculate the difference between the real values y(x) against our predicted values a(x).

Squaring this does 2 useful things for us, keeps everything positive and punishes large errors!

Figure 14: Multi Layer Network with parameter.

In a real case, this means we have some cost function C dependent lots of weights! C(w1,w2,w3,….wn)
How do we figure out which weights lead us to the lowest cost?
Gradient Descent is used to find the optimised value of weights and bias which has low cost function.
The learning rate we showed in our illustrations was constant (each step size was equal). But we can be clever and adapt our step size as we go along
We could start with larger steps, then go smaller as we realize the slope gets closer to zero. This is known as adaptive gradient descent.

Figure 15: Gradient Descent Performance.

Adam versus other gradient descent algorithms.
For binary classification we use binary cross entropy.

For class more than 2 we use categorical cross entropy.

Once we get our cost/loss value, how do we actually go back and adjust our weights and biases? This is backpropagation.

Intuition about Derivatives:

When we calculate the derivative of function it gives the value how the change in the input has affected the output.

Backpropagation:

We want to know how the cost function results changes with respect to the weights in the network, so we can update the weights to minimise the cost function