CNN Architectures

Do you usually wonder what CNN architectures are doing? How are random layers stacked and how are such huge architectures designed? Before we move on to a case study, we will understand some CNN architectures, and also, to get a sense of the learning neural networks do, we will discuss various neural networks. Hence, let us cover various computer vision model architectures, types of networks and then look at how these are used in applications that are enhancing our lives daily.  

Model architectures
There are many CNN architectures proposed, and we can find more information about these research papers in the model zoos that exist along with implementing the same. For example, Keras has a zoo of models, where all model weights can be found. Let’s study some main architectures like the AlexNet, Inception, ResNet.
Read all about Artificial Intelligence here.
What are the various popular model architectures available?

  1. AlexNet
  2. GoogLeNet
  3. VGGNet
  4. ResNet


  • AlexNet is a masterpiece created by the SuperVision group, which included the masterminds Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever from the University of Toronto.
  • The winner of ImageNet-2012, AlexNet showed that deep learning was the way towards achieving the least error-rates. 

Learn how pattern recognition works.

AlexaNet Convolutional Neural Network


What is the architectural structure of AlexNet? 

  • The major feature of AlexNet is that it overlaps the pooling operation to reduce the size of the network. 
  • With five convolution layers and three fully connected layers, and ReLu function applied after every Convolutional layer and fully connected layer, AlexNet showed us the way towards achieving state-of-the-art results in image classification.
  • It uses the ReLu as its activation function, which speeds the rate of training and increases the accuracy. The regularization technique it uses is a dropout. Next, we will consider Inception architecture. Wondering what is the difference between architectures? Let us analyze the architecture design methodology of Inception architecture.

How is GoogLeNet / Inception(2014) designed?

  • Also known as the inception model, because of the inception module present in the architecture(derived from the movie Inception), GoogLeNet won the ImageNet 2014. 
  • The network uses a CNN inspired by LeNet. Its architecture includes 1×1 Convolutions in the middle of the network. It uses global average pooling in place of fully connected layers. 
  • Various techniques like batch normalization, image distortions, and RMSprop are used to improve accuracy.

VGGNet (2014)

  • VGGNet was created by the VGG (Visual Geometry Group) from the University of Oxford. It was the runner-up in the ImageNet -2014 challenge. 
  • It is  mostly used as a feature extraction algorithm. With many filters, it acts as a base model for single-shot Detectors used for object detection.


  • ResNet won the ILSVRC 2015. It was the first model to beat human-level accuracies. It was the deepest network with 152 layers. 
  • The novelty of the model is that it introduces skip connections and features heavy batch normalization.

What is the advantage of ResNet?

  • ResNet reduces the vanishing gradient problem to a minimum. Vanishing gradients occur when the change in weights is so low that error does not backpropagate through the huge number of layers present in deep learning models.
Different types of Neural Networks

Now that we have looked at various model architectures, it is suggested to go through the papers to get a better understanding of the same. Let us move towards various types of neural networks. We will cover Recurrent Neural Networks(RNNs), Long- Short Term Memory(LSTMs) and then look at the applications.

What are various types of neural networks

  1. Convolutional neural networks
  2. Recurrent neural networks
  3. LSTMs
  4. Gated- Recurrent Units (GRUs)

Why use Recurrent neural networks (RNN)?

  • CNNs are bad at modeling sequential information. Hence, the solution to the problem is coming up with a network that models the sequential patterns. 
  • RNNs solve the above problem, and the way it performs the same task is by introducing a feedback element, that takes the output of the previous data in a series as its next input.
  • This lets the network learn the correlation between the current data point and the previous data point. 
  • Add several recurrent units, and we can learn the correlation between many data-points in sequential data. Well, with all these points for RNNs, are there any against it? We will find out next. 

What are the disadvantages of RNNs?

  • Although this should work theoretically, RNNs suffers from the vanishing gradient problem, as we often lose the sequential information with an increase in the number of recurrent units. Hence, LSTMs are the proposed solution for the same.

Long short-term memory (LSTM)
Long short-term memory (LSTM) can store a larger number of data points for larger periods, and thus, it works well with capturing long-term efficiencies. 
LSTMs have several gates. What are they? 

  1. forget gate,
  2. input gate
  3. output gate. 

How do LSTM neurons work?
The role of the forget gate is to maintain the information of the previous state. The input gate works by updating the current state with the help of the input. The output gate decides how much of the information needs to be passed onto the next state. Hence, it subtly imbibes the qualities of forgetting and keeping selective patterns, thus allowing it to keep information over a longer period.
Let’s move on to the most exciting aspect of deep learning, its applications. Considering applications in the likes of classification, detection, localization, segmentation, and image captioning is of key interest to us here. Computer Vision today has enabled smarter homes, smarter supermarkets, smarter shopping, and is enabling the smart-phone to be a huge revolutionary platform. Smaller and efficient models that can run on smart-phones are enabling the same. 
Where do we get data for commercial/ real-time applications? 
We can capture the data for the applications in several ways, 

  1. CCTV footage
  2. Drones
  3. Phones
  4. Sensors
Computer vision applications

These capturing devices act as a sensor that captures information. The information passes onto the model either via the cloud, wireless network or processed on a local processor. Let’s look at the applications in greater detail.

Can you name some applications of computer vision?

  1. Classification: Image classification identifies which class an image belongs to. The task performed gets better if the result also talks about the confidence with which we have classified the image to belong to a certain class. In the next tutorial, we will study the implementation of an image classification model implemented on the food-101 dataset. There are various applications for the same. Essentially, our imagination is the limit. For instance, identifying if a bag left unattended at a public space is harmful or not is an off-beat example. We can also use image classification models for image retrieval, which search engines make the most use of. Given a keyword, identify the set of most relevant images and then retrieve similar images to be shown on the search page.
  2. Object Detection/Localization: Classification classifies an image but does not talk about which part of the image belongs to the class. Hence, object detection is a task that finds where the object of interest is in an image and outputs the same with a bounding box. This is useful in applications that involve knowing the relative distances and orientations of objects. For example, in self-driving cars, knowing where pedestrians or any other objects can help make safer decisions. The metric used to measure the performance of object detection algorithms is IOU(Intersection over Union).
  3. Segmentation: Well, detection deals with generating bounding boxes. Segmentation approaches the problem differently by considering pixel-wise classification. This gives information about the finer details like the boundaries of objects, and thus finer the information better is the output. It is useful for processing medical images and satellite imagery.

Learn how deep learning powers computer vision.

There are many more applications like similarity learning, image captioning, generative models, video analysis, etc. 

  1. Similarity learning: Similarity learning is an understanding of the similarity between images. We can base the similarity upon the context, semantic understanding, or just the number of patterns that overlap with each other. So next time you are searching for images on Google, a backdrop of similarity learning is helping you find similar images quickly.
  2. Image caption: Image captioning is used to generate captions for an image. It generates captions with the help of object classification models and LSTMs. The LSTMs helps model the sequential information between the image and the caption associated with it. It is an application that deals with both computer vision and natural language processing. The use of image captioning is to understand the relative positioning of a subject with another subject in a given image. Here we show an example for DenseCap.
  3. Generative models: Generative models deal with training networks to generate images based on their understanding or learning. During the training phase, for instance, if the dataset is of faces, the generative model learns about the features that make up the face. Hence, upon learning the features of the face, it can generate new faces thereafter. An interesting application of the same is the generation of celebrity faces. The application for the same is in the field of training deep learning models. Since these models are data-hungry, gathering huge amounts of data is a tedious task. Instead, generating data according to our needs is a better option to train the model accordingly.
  4. Video analysis: Let’s suppose we enter a shopping mall and based on our activity in the mall, our personalized assistant recommends products that might interest us. Video analysis deals with delivering real-time results and coming up with patterns, conclusions, and results that enhance the application. For example, analysis of sports events like cricket, football, etc can be automated. The position of players, real-time statistics, and best approaches to defeating the strongest team are a few of the examples. 

In this article, we have covered a lot of topics, including model architectures, types of neural networks and applications in the domain of computer vision. The major industries that will be impacted due to advances in this field are the manufacturing sector, the automobile sector, health care, and agriculture. Efficient implementation of these algorithms and better applications are the need of the hour to solve the most challenging problems at hand. And all the resources are available online mostly for free. Thus, we encourage you to build applications and products that truly bring about a change in the world. 



Please enter your comment!
Please enter your name here