Deep Learning in Computer Vision
Welcome to the second article in the computer vision series. The article intends to get a heads-up on the basics of deep learning for computer vision. To ensure a thorough understanding of the topic, the article approaches concepts with a logical, visual and theoretical approach. The most talked-about field of machine learning, deep learning, is what drives computer vision- which has numerous real-world applications and is poised to disrupt industries.
Deep learning is a subset of machine learning that deals with large neural network architectures. We will discuss basic concepts of deep learning, types of neural networks and architectures, along with a case study in this.
Our journey into Deep Learning begins with the simplest computational unit, called perceptron.
See how Artificial Intelligence works.
What is a Perceptron?
A perceptron, also known as an artificial neuron, is a computational node that takes many inputs and performs a weighted summation to produce an output. A simple perceptron is a linear mapping between the input and the output.
Several neurons stacked together result in a neural network. A training operation, discussed later in this article, is used to find the “right” set of weights for the neural networks. The limit in the range of functions modelled is because of its linearity property. All models in the world are not linear, and thus the conclusion holds. The next logical step is to add non-linearity to the perceptron. We achieve the same through the use of activation functions.
Activation functions are mathematical functions that limit the range of output values of a perceptron.
Why do we need non-linear activation functions?
Non-linearity is achieved through the use of activation functions, which limit or squash the range of values a neuron can express. The activation function fires the perceptron. The training process includes two passes of the data, one is forward and the other is backward. Activation functions help in modelling the non-linearities and efficient propagation of errors, a concept called a back-propagation algorithm.
Examples of activation functions
For instance, tanh limits the range of values a perceptron can take to [-1,1], whereas a sigmoid function limits it to [0,1]. Usually, activation functions are continuous and differentiable functions, one that is differentiable in the entire domain. Apart from these functions, there are also piecewise continuous activation functions.
Some activation functions:
Various types of Activation Functions:
- Sigmoid: Sigmoid is a smoothed step function and thus differentiable. Sigmoid is beneficial in the domain of binary classification and situations where the need for converting any value to probabilities arises. It limits the value of a perceptron to [0,1], which isn’t symmetric.
- tanh: The hyperbolic tangent function, also called the tanh function, limits the output between [-1,1] and thus symmetry is preserved. An important point to be noted here is that symmetry is a desirable property during the propagation of weights. You can find the graph for the same below.
- The Rectified Linear Unit (ReLU): Relu is defined as a function y=x, that lets the output of a perceptron, no matter what passes through it, given it is a positive value, be the same. If the output of the value is negative, then it maps the output to 0. Therefore we define it as max(0, x), where x is the output of the perceptron.
Artificial Neural Network (ANN)
As mentioned earlier, ANNs are perceptrons and activation functions stacked together. The perceptrons are connected internally to form hidden layers, which forms the non-linear basis for the mapping between the input and output. The number of hidden layers within the neural network determines the dimensionality of the mapping. Higher the number of layers, the higher the dimension in which the output is being mapped. The ANN learns the function through training. This stacking of neurons is known as an architecture. We shall cover a few architectures in the next article. The model learns the data through the process of the forward pass and backward pass, as mentioned earlier.
What happens during the forward pass?
- During the forward pass, the neural network tries to model the error between the actual output and the predicted output for an input. It is done so with the help of a loss function and random initialization of weights.
- The loss function signifies how far the predicted output is from the actual output. After the calculation of the forward pass, the network is ready for the backward pass.
What happens during the backward pass?
- The backward pass aims to land at a global minimum in the function to minimize the error. The objective here is to minimize the difference between the reality and the modelled reality.
- Upon calculation of the least error, the error is back-propagated through the network.
- Thus we update all the weights in the network such that this difference is minimized during the next forward pass. Hit and miss learning leads to accurate learning specific to a dataset.
These are the Various Concepts related to Neural Networks
- Softmax: Softmax function helps in defining outputs from a probabilistic perspective. Let’s say we have a ternary classifier which classifies an image into the classes: rat, cat, and dog. The final layer of the neural network will have three nodes, one for each class. If the prediction turns out to be like 0.001, 0.01 and 0.02. We will not be able to infer that the image is that of a dog with much accuracy and confidence. Instead, if we normalized the outputs in such a way that the sum of all the outputs was 1, we would achieve the probabilistic interpretation about the results. Softmax converts the outputs to probabilities by dividing the output by the sum of all the output values.
- Cross-entropy: Cross-entropy compares the distance metric between the outputs of softmax and one hot encoding. Cross-entropy is defined as the loss function, which models the error between the predicted and actual outputs. With the help of softmax function, networks output the probability of input belonging to each class. The right probability needs to be maximized. We define cross-entropy as the summation of the negative logarithmic of probabilities. Use of logarithms ensures numerical stability.
- Regularization: When a student learns, but only what is in the notes, it is rote learning. Rote learning is of no use, as it’s not intelligence, but the memory that is playing a key role in determining the output. Let us say if the input given belongs to a source other than the training set, that is the notes, in this case, the student will fail. Hence, we need to ensure that the model is not over-fitted to the training data, and is capable of recognizing unseen images from the test set. This is achieved with the help of various regularization techniques. These techniques have evolved over time as and when newer concepts were introduced. For example, Dropout is a relatively new technique used in the field of deep learning.
What are the various regularization techniques used commonly?
- Batch Normalization
- L1 and L2 Regularization
- Dropout: Dropout is an efficient way of regularizing networks to avoid over-fitting in ANNs. The dropout layers randomly choose x percent of the weights, freezes them, and proceeds with training. Hence, stochastically, the dropout layer cripples the neural network by removing hidden units. Dropout is also used to stack several neural networks. For each training case, we randomly select a few hidden units so we end up with various architectures for every case. It is not to be used during the testing process. The keras implementation takes care of the same.
- Batch normalization: Batch normalization, or batch-norm, increases the efficiency of neural network training. It normalizes the output from a layer with zero mean and a standard deviation of 1, which results in reduced over-fitting and makes the network train faster. It has remarkable results in the domain of deep networks.
- L1 and L2 regularization: L1 penalizes the absolute distance of weights, whereas L2 penalizes the squared distance of weights. Visualizing the concept, we understand that L1 penalizes absolute distances and L2 penalizes relative distances.
How do we train neural networks?
Let’s go through training. We should keep the number of parameters to optimize in mind while deciding the model. Higher the number of parameters, larger will the dataset required to be and larger the training time. Thus, model architecture should be carefully chosen. The updation of weights occurs via a process called backpropagation.
Backpropagation (Calculus knowledge is required to understand this): It is an algorithm which deals with the aspect of updation of weights in a neural network to minimize the error/loss functions. The weights in the network are updated by propagating the errors through the network.
What is the amount by which the weights need to be changed?
The answer lies in the error. After we know the error, we can use gradient descent for weight updation.
Gradient descent: what does it do?
The gradient descent algorithm is responsible for multidimensional optimization, intending to reach the global maximum. It is a sort-after optimization technique used in most of the machine-learning models. Another implementation of gradient descent, called the stochastic gradient descent (SGD) is often used.
Learning Rate: The learning rate determines the size of each step. Note that the ANN with nonlinear activations will have local minima. The choice of learning rate plays a significant role as it determines the fate of the learning process. If the learning rate is too high, the network may not converge at all and may end up diverging. There are various techniques to get the ideal learning rate. We will delve deep into the domain of learning rate schedule in the coming blog.
What functions does SGD optimize?
- SGD works better for optimizing non-convex functions. SGD differs from gradient descent in how we use it with real-time streaming data.
- The size of the partial data-size is the mini-batch size. So it decides the frequency with which the update takes place, as in reality, the data can come in real-time, and not from memory. Using one data point for training is also possible theoretically. It is better to experiment.
What is the importance of batch-size?
- Let us understand the role of batch-size. The size of the batch-size determines how many data points the network sees at once.
- If the value is very high, then the network sees all the data together, and thus computation becomes hectic.
- If it sees less number of images at once, then the network does not capture the correlation present between the images.
After discussing the basic concepts, we are now ready to understand how deep learning for computer vision works.
Convolutional neural networks (CNNs):
What is the convolutional operation exactly?
It is a mathematical operation derived from the domain of signal processing. Convolution is used to get an output given the model and the input. The model is represented as a transfer function. The input convoluted with the transfer function results in the output. Simple multiplication won’t do the trick here.
Considering all the concepts mentioned above, how are we going to use them in CNN’s? What are the key elements in a CNN?
- CNNs are deep neural networks that have specific operations to get the spatial information present in images.
- With two sets of layers, one being the convolutional layer, and the other fully connected layers, CNNs are better at capturing spatial information.
- In traditional computer vision, we deal with feature extraction as a major area of concern. In deep learning, the convolutional layers are taking care of the same for us. We thus have to ensure that enough number of convolutional layers exist to capture a range of features, right from the lowest level to the highest level.
Why can’t we use Artificial neural networks in computer vision?
- ANNs deal with fully connected layers, which used with images will cause overfitting as neurons within the same layer don’t share connections. Thus, it results in a larger size because of a huge number of neurons.
- The solution is to increase the model size as it requires a huge number of neurons. We can look at an image as a volume with multiple dimensions of height, width, and depth. Depth is the number of channels in an image(RGB).
Convolution neural network learns filters similar to how ANN learns weights. Various transformations encode these filters. We shall understand these transformations shortly. The filters learn to detect patterns in the images. The deeper the layer, the more abstract the pattern is, and shallower the layer the features detected are of the basic type. Thus these initial layers detect edges, corners, and other low-level patterns.
An interesting question to think about here would be: What if we change the filters learned by random amounts, then would overfitting occur? Also, what is the behavior of the filters given the model has learned the classification well, and how would these filters behave when the model has learned it wrong?
Various components of CNN that enable us to learn the spatial information
- Convolutional layers use the kernel to perform convolution on the image. The kernel works with two parameters called size and stride.
- Stride is the number of pixels moved across the image every time we perform the convolution operation.
- The size is the dimension of the kernel which is a measure of the receptive field of CNN. Stride controls the size of the output image.
- For instance, when stride equals one, convolution produces an image of the same size, and with a stride of length 2 produces half the size. Thus, a decrease in image size occurs, and thus padding the image gets an output with the same size of the input.
- Pooling layers reduce the size of the image across layers by a process called sampling, carried by various mathematical operations, like minimum, maximum, averaging,etc, that is, it can either be selecting the maximum value in a window or taking the average of all values in the window.
- We place them between convolution layers.
Consider the kernel and the pooling operation. In the following example, the image is the blue square of dimensions 5*5. The kernel is the 3*3 matrix represented by the colour dark blue. Through a method of strides, the convolution operation is performed. The dark green image is the output. To obtain the values, just multiply the values in the image and kernel element wise.
For example: 3*0 + 3*1 +2*2 +0*2 +0*2 +1*0 +3*0+1*1+2*2 = 12
What are the advantages of pooling?
- Pooling acts as a regularization technique to prevent over-fitting. Pooling is performed on all the feature channels and can be performed with various strides.
CNN is the single most important aspect of deep learning models for computer vision. Now that we have learned the basic operations carried out in a CNN, we are ready for the case-study. The best approach to learning these concepts is through visualizations available on YouTube. That shall contribute to a better understanding of the basics.