Understanding Convolutional Neural Network (CNN).

Rishi Kumar
Nerd For Tech
Published in
6 min readJul 3, 2021

--

The whole idea behind CNN began with our brain. Human brain processes image easily, where it passes through retina as electrical signal to primary visual cortex packed with large dense layers of cell. It extracts various information of the image like edge, parts of the image, pertinently CNN uses various filters to extract information from the input image.

Convolutional Neural Network vs Human Brain

Image Kernels/ Filters:

If you’ve ever used photo editing software, you have probably seen filters, such as blur filter. But how do these work?

Filters are essentially an image kernels, which is a small matrix applied to an entire image. Filters allow us to transform images by extracting information from the images by identifying edges, where multiple kernels is used in neural networks to identify edges. Kernel is applied in element-wise in sliding-window fashion.

Filters

We essentially slide kernel over the input matrix i.e image which gets multiply by filter weights and sum of those results to get the final feature. Notice how the resolution decreases because we are taking 9 input value and outputting to one value. Refer this website to get an idea how kernel works deep lizard. These filters are referred to as Convolution Kernels. The process of passing them over an image is known as Convolution. The movement of kernel in step is called striding generally it’s one. If stride size increases, image size will decrease. The prime usage of convolution is to find features in the image using feature detector, put them in a feature map, which still preserve the importance of the original image.

Stride

During convolution, we would lose information of the borders. So we can pad the image with more values. The really common practise in zero padding and this allows us to not lose the information along the border. Adding zero around the image to capture the edge properly is called zero padding.

→ The General formula to know the convolution matrix size without padding is (N x N) * (F x F) = (N-F+1)x(N-F+1). This can be achieved by giving ‘valid’ for the parameter padding.

→ The formula of the convolution matrix size with padding is (N+2p-F+1)x(N+2p-F+1). If the padding size is one resultant matrix size is same as input matrix. This can be achieved by giving ‘same’ for the parameter padding.

→ To depict what would be the correct padding size, this formula is used p = (F-1)/2.

Without padding vs with padding.

Drawbacks in using ANN for images:

ANN leads to parameter explosion which has large parameter to train neural network. ANN will lose the information while flattening out the image. ANN works better for similar kind of image. ANN captures only the centre of the image, whereas CNN captures regardless of any position.

A CNN uses convolutional layers to help alleviate these issues. A convolutional layer is created when we apply multiple image filters to the input images. The layer will then be trained to figure out the best filter weight values. A CNN also helps reduce parameters by focusing on local connectivity. In convolutional layers not all neurons will be fully connected. Instead, neurons are only connected to a subset of local neurons in the next layer which end-up being the filters.

Convolution focuses on local filter, where different filter start began to identify different parts of the image. Stacking filters together will result in convolutional layers. For colour images, we have intensity values of RGB, it is represented as (1280, 720, 3) (Height, Width, Colour). In colour images we end up with 3D filters where the often convolutional layers are fed into another convolutional layers, this allows the networks to discover patterns within patterns usually with more complexity for later convolutional layers.

Pooling Layers:

Why do we need pooling?

Suppose we take images of cheetah which has face in different position of each image. Pooling takes out the important feature from the image which helps to identify images regardless of any position.

Even with the local connectivity, when dealing with colour images and possibly 10s or 100s of filters we will have a large amount of parameters. We can use pooling layers to reduce this. Pooling layers accept convolutional layer as input. Neurons in a pooling layer have no weights or biases. A pooling layer simply applies some aggregation function to all inputs.

CNN Architecture

There are several types of pooling available like Max Pooling, Average Pooling, Sum Pooling.

Max Pooling takes the maximum value of the box which moves all over the matrix with filter size of (2 x 2) and stride length of 2.

Max Pooling.

You can see from the above image after applying pooling layers the important information is still preserved whereas the 16 elements is reduced into 4 elements which helps Neural Network to recognise features independent of location (location invariance). Average pooling is simple taking out the average of the box matrix.

Pooling greatly reduces our number of parameters. This pooling layer will end up removing a lot of information, even a small pooling ‘kernel’ of (2x2) with the stride of 2 will remove 75% of the input data. However the general trends will be true through out the pooling layer by which it creates generalised model mitigate overfitting.

Flattening:

Pooled feature map is flattened to column vector before feeding it to the densely connected Artificial Neural Network.

CNN Architecture

Input image is processed by convolutional layer which comprises of kernels and activation layer to make the image non-linear, then it passes thorough pooling layer to minimise the size of an image before transferring it to the fully connected layer it is flattened.

CNN can all types of architecture. There is no thumb rule to design architecture it’s all based on the error metrics and use cases.

Inspite of any architecture of CNN, lastly feature map is to be flattened and feed to the fully connected layer to merge all the features extracted.

Each group of convolution is followed by activation layer which also outputs an image. However successive outputs are smaller and smaller (due to pooling layers) as well as deeper and deeper (due to feature maps in convolutional layers). This entire set of layers is fed into a regular feed forward NN, finally it’s feed into SoftMax prediction layers.

Refer this website to understand how CNN architecture works ryerson

Hope you got the basic understanding about CNN working and refer github to build CNN using python.

--

--

Rishi Kumar
Nerd For Tech

I'm a passionate and disciplined Data Science enthusiast working with Logitech as Data Scientist