Sentiment Analysis and Invoice Management System with Cloud Computing

Reading Time: 5 minutes

Here is a cloud computing project on sentiment analysis and invoice management system by Sunil Vadapalli, Virender Yadav, and Rambabu Donthi Boyina on sentiment analysis and invoice management.



For every company, start-up or a large enterprise, social media is a premier platform to promote their business. Marketing their new products and gaining more customers through social media has grown over the last decade and has now become indispensable. Businesses also get product reviews, service reviews, or user reviews through social platforms. Linkedin, Facebook, Instagram, and Twitter have been the leading platforms to build brand awareness. Capturing the data from these platforms helps businesses to get real-time analysis and understand the sentiments of the customers. Few such examples can be seen below:

cloud computing project - sentiment analysis and invoice management system               cloud computing project - sentiment analysis and invoice management system

cloud computing project - sentiment analysis and invoice management system


The Capstone Project  

cloud computing project - sentiment analysis and invoice management system


Problem statement:

The Invoice management system of our client works by sending invoices through emails to the accounts team, or by uploading them through their legacy application. Their application is not very robust and has many loopholes that create major problems for the teams at the end of the year for analysis. Accounts teams end up feeding all the client invoices manually into other systems after reviewing all the data from the legacy application. There are nearly 100 invoice formats that need to be loaded every month. The accounts team spends more time in analyzing these different formats and working on the different requirements for budget planning or reporting. All this leads to data inconsistency, and data availability has also become a challenge for the client.

Along with the above legacy application, the client also has internal applications to capture sentiments of the internal stakeholders through different surveys but don’t have any means to capture real-time sentiments/statistics from different social media channels for external stakeholders like shareholders, customers, and others.


For managing the invoices, we have tried implementing Amazon S3, SNS and Glue services. We used the Kinesis streams and Kinesis data analytics to capture the real-time data for Sentiment analytics. S3 and Glue are also used here for managing the streaming data after analysis. We used RDS to store the data and Athena for generating data sets to visualize through Amazon QuickSight. 

Note: For testing purposes, the real-time data generation can be done using AWS Lambda and Cloud Watch event triggers. However, the cloud watch event triggers can be scheduled only to a minimum of a 1-minute interval. In order to increase the flow of the stream, the AWS Step Function has been used. This step function triggers the iterator Lambda function which in turn triggers the actual Lambda function that generates the data for the stream.

Role of cloud services:

S3 Used to gather invoice data from the client and to gather streaming data from Kinesis.

SNS To send an acknowledgment after receiving the invoice using different endpoints.

Glue Crawlers To pull the data from S3 buckets into the Glue tables.

RDS Above invoice data is loaded into RDS which can be accessed from the application.

Athena For querying the data and for preparing data sets.

QuickSight The data sets that are prepared will be visualized to top executives and PMO.

Kinesis Streams To capture real-time data for analysis.

Kinesis Data Analytics Analyse the streaming data generated by stakeholders, company employees on social media handles on Twitter, Linkedin, Facebook regarding revenue growth, newly deployed functions, achievements, and concerns.

Kinesis Firehose To push the data from Streams to S3.

LambdaTo create streaming data internally.

CloudWatch Events To trigger the Lambda function every minute. (That is the least that can be achieved using the Events)

Step Functions To trigger the CloudWatch Events for more number of times in a minute in order to generate decent streaming data in a minute. 


Business flow

Invoice management system:

All the invoices received are stored in the S3 bucket and an acknowledgment would be sent to the User once the invoice is received. In one flow, the invoices are pulled into Glue tables using Crawlers and this data is again pushed to RDS using ETL jobs in Glue. The other flow pushes the data into Athena to generates the data set which can be visualized through Amazon QuickSight. 

Sentiment analysis:

In the above scenario, a product review is considered for analysis. Two flows were created, one flow pushes the complete data into the database. However, the other flow is through Kinesis Analytics, wherein the data is filtered based on the requirement. For example, all the negative reviews such as “Poor”, “Not satisfied”, etc. are considered and this report is sent to the Athena to generate the data set which can be visualized through Amazon QuickSight. It helps in taking important decisions on the business flow.

This same business flow can also be used in many scenarios like polling during sports, political elections, etc.

Invoice application:

A custom invoice management application is used to read the data from the database. This application is deployed on EC2 instances along with load balancers. It will be used by the accounts teams around the globe to generate the reports required for their respective regions.


Sample screenshots


cloud computing project - sentiment analysis and invoice management system



cloud computing project - sentiment analysis and invoice management system


Kinesis Analytics

cloud computing project - sentiment analysis and invoice management system



cloud computing project - sentiment analysis and invoice management system


While working on this project, cloud computing helped to understand many functional flows easily. The serverless concept helped to implement many services and experience functional flows practically. Monitoring the costs and keeping an eye when transferring the data from one region to another is necessary. This platform can help in understanding and building resilient architectures by practice.

This capstone project is a part of the PG program in cloud computing. Reach out to us if you are interested in pursuing this course with Great Learning. Read more about cloud computing technologies, applications, and career prospects in our blog section

About the Authors:

Sunil Vadapalli – Sunil started working as a Cloud Engineer and developed an interest in Cloud Computing Architecture and Networking. He has got 8 years of experience as QA but now he is shifting his profile completely towards cloud computing.

Virender Yadav – A.G.M (Server and Database Technologies) at Hinduja Global Solutions with 14+ years of experience in Server and Database technologies(Oracle/SQL Server/MySQL/Informix/Postgress), Linux Suse/Redhat/Open Source, IT Infra Solutioning and Implementation, Production Support and Performance Optimization, Dataware house Design and Implementation.

Rambabu Donthi Boyina – Rambabu has got 15.3 years of experience in Windows, VMware, Active Directory, Security group policy, Citrix and scripting He is working for FIS as Sr.Systems Engineer.

Top 50 Machine Learning Interview Questions

Reading Time: 17 minutes

A Machine Learning interview calls for a rigorous interview process where the candidates are judged on various aspects such as technical and programming skills, knowledge of methods and clarity of basic concepts. Here, we have compiled a list of machine learning interview questions that you might face during an interview.


1. What are the different types of Learning/ Training models in ML?

ML algorithms can be primarily classified depending on the presence/absence of target variables.
A. Supervised learning: [Target is present]
The machine learns using labelled data. The model is trained on an existing data set before it starts making decisions with the new data.

The target variable is continuous: Linear Regression, polynomial Regression, quadratic Regression
The target variable is categorical: Logistic regression, Naive Bayes, KNN, SVM, Decision Tree, Gradient Boosting, ADA boosting, Bagging, Random forest etc.

B. Unsupervised learning: [Target is absent]

The machine is trained on unlabelled data and without any proper guidance. It automatically infers patterns and relationships in the data by creating clusters. The model learns through observations and deduced structures in the data.

Principal component Analysis, Factor analysis, Singular Value Decomposition etc.

C. Reinforcement Learning:

The model learns through a trial and error method. This kind of learning involves an agent that will interact with the environment to create actions and then discover errors or rewards of that action.

2. What is the difference between deep learning and machine learning?

Machine Learning involves algorithms that learn from patterns of data and then apply it to decision making. Deep Learning, on the other hand, is able to learn through processing data on its own and is quite similar to the human brain where it identifies something, analyse it, and makes a decision.

The key differences are as follows-
a. The manner in which data is presented to the system.
b. Machine learning algorithms always require structured data and deep learning networks rely on layers of artificial neural networks.

machine learning machine learning interview questions


3. How do you select important variables while working on a data set? 

There are various means to select important variables from a data set that include the following:

a. Identify and discard correlated variables before finalizing on important variables

b. The variables could be selected based on ‘p’ values from Linear Regression

c. Forward, Backward, and Stepwise selection

d. Lasso Regression

e. Random Forest and plot variable chart

Top features can be selected based on information gain for the available set of features.


4. How are covariance and correlation different from one another?

Covariance measures how two variables are related to each other and how one would vary with respect to changes in the other variable. If the value is positive it means there is a direct relationship between the variables and one would increase or decrease with an increase or decrease in the base variable respectively, given that all other conditions remain constant.

Correlation quantifies the relationship between two random variables and has only three specific values, i.e., 1, 0, and -1. 1 denotes a positive relationship, -1 denotes a negative relationship, and 0 denotes that the two variables are independent of each other.


5. When does regularization come into play in Machine Learning?

At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.


6. What is Bias, Variance and what do you mean by Bias-Variance Tradeoff?

Both are errors in Machine Learning Algorithms. When the algorithm has limited flexibility to deduce the correct observation from the dataset, it results in bias. On the other hand, variance occurs when the model is extremely sensitive to small fluctuations.

If one adds more features while building a model, it will add more complexity and we will lose bias but gain some variance. In order to maintain the optimal amount of error, we perform a tradeoff between bias and variance based on the needs of a business.


Machine Learning Interview Questions - Bias and Variance

Source: Understanding the Bias-Variance Tradeoff: Scott Fortmann – Roe

7. How can we relate standard deviation and variance? 

Standard deviation refers to spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.


8. Is high variance in data good or bad? Explain with a reason. 

Higher variance directly means that the data spread is big and the feature has a variety of data. Usually, high variance in a feature is seen as not so good quality.


9. What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Gradient Descent and Stochastic Gradient Descent are the algorithms that find the set of parameters that will minimize a loss function.

The difference is that in Gradient Descend, all training samples are evaluated for each set of parameters. While in Stochastic Gradient Descent only one training sample is evaluated for the set of parameters identified.


10. Can you mention some advantages and disadvantages of decision trees?

The advantages of decision trees are that they are easier to interpret, are nonparametric and hence robust to outliers, and have relatively few parameters to tune.

On the other hand, the disadvantage is that they are prone to overfitting.


11. What is the Principle Component Analysis?

The idea here is to reduce the dimensionality of the data set by reducing the number of variables that are correlated with each other. Although the variation needs to be retained to the maximum extent.

The variables are transformed into a new set of variables that are known as Principal Components’. These PCs are the eigenvectors of a covariance matrix and therefore are orthogonal.


12. What are outliers? Mention three methods to deal with outliers.

A data point that is considerably distant from the other similar data points is known as an outlier. They may occur due to experimental errors or variability in measurement. They are problematic and can mislead a training process, which eventually results in longer training time, inaccurate models, and poor results.

The three methods to deal with outliers are:

Univariate method – looks for data points having extreme values on a single variable

Multivariate method – looks for unusual combinations on all the variables

Minkowski error – reduces the contribution of potential outliers in the training process

Machine Learning Interview Questions

Stay tuned to this page for more such information on interview questions and career assistance. If you are want to prepare more to land your dream job in Machine Learning, you can upskill with Great Learning’s PG program in Machine Learning, and learn all about Data Science along with great career support.

Also, Read - Advantages of pursuing a career in Machine Learning

13. What is the difference between regularization and normalisation? 

Normalisation adjusts the data; regularisation adjusts the prediction function. If your data is on very different scales (especially low to high), you would want to normalise the data. Alter each column to have compatible basic statistics. This can be helpful to make sure there is no loss of accuracy. One of the goals of model training is to identify the signal and ignore the noise, if the model is given free rein to minimize error, there is a possibility of suffering from overfitting. Regularization imposes some control on this by providing simpler fitting functions over complex ones.


14. List the most popular distribution curves along with scenarios where you will use them in an algorithm.

The most popular distribution curves are as follows- Bernoulli Distribution, Uniform Distribution, Binomial Distribution, Normal Distribution, Poisson Distribution, and Exponential Distribution.

Each of these distribution curves is used in various scenarios.

Bernoulli Distribution can be used to check if a team will win a championship or not, a new born child is either male or female, you either pass an exam or not, etc.
Uniform distribution is a probability distribution that has a constant probability. Rolling a single dice is one example because it has a fixed number of outcomes.
Binomial distribution is a probability with only two possible outcomes, the prefix ‘bi’ means two or twice. An example of this would be a coin toss. The outcome will either be heads or tails.
Normal distribution describes how the values of a variable are distributed. It is typically a symmetric distribution where most of the observations cluster around the central peak. The values further away from the mean taper off equally in both directions. An example would be the height of students in a classroom.
Poisson distribution helps predict the probability of certain events happening when you know how often that event has occurred. It can be used by businessmen to make forecasts about the number of customers on certain days and allows them to adjust supply according to the demand.
Exponential distribution is concerned with the amount of time until a specific event occurs. For example, how long a car battery would last, in months.


15. How do we check the normality of a data set or a feature? 

Visually, we can check it using plots. There is a list of Normality checks, they are as follows-
a. Shapiro-Wilk W Test
b. Anderson-Darling Test
c. Martinez-Iglewicz Test
d. Kolmogorov-Smirnov Test
e. D’Agostino Skewness Test


16. What is Linear Regression?

Linear Function can be defined as a Mathematical function on a 2D plane as,  Y =Mx +C, where Y is a dependent variable and X is Independent Variable, C is Intercept and M is slope and same can be expressed as Y is a Function of X or Y = F(x).

At any given value of X, one can compute the value of Y, using the equation of Line. This relation between Y and X, with a degree of the polynomial as 1 is called Linear Regression.

In Predictive Modeling, LR is represented as Y = Bo + B1x1 + B2x2

The value of B1 and B2 determines the strength of the correlation between features and the dependent variable.

Example: Stock Value in $ = Intercept + (+/-B1)*(Opening value of Stock) + (+/-B2)*(Previous Day Highest value of Stock)


17. Differentiate between regression and classification.

Regression and classification are categorized under the same umbrella of supervised machine learning. The main difference between them is that the output variable in the regression is numerical (or continuous) while that for classification is categorical (or discrete).

Example: To predict the definite Temperature of a place is Regression problem whereas predicting whether the day will be Sunny cloudy or there will be rain is a case of classification. 


18. What is target imbalance? How do we fix it? A scenario where you have performed target imbalance on data. Which metrics and algorithms do you find suitable to input this data onto? 

If you have categorical variables as the target when you cluster them together or perform a frequency count on them if there are certain categories which are more in number as compared to others by a very significant number. This is known as the target imbalance. Example: Target column – 0,0,0,1,0,2,0,0,1,1 [0s: 60%, 1: 30%, 2:10%] 0 are in majority. To fix this, we can perform up-sampling or down-sampling. Before fixing this problem let’s assume that the performance metrics used was confusion metrics. After fixing this problem we can shift the metric system to AUC: ROC. Since we added/deleted data [up sampling or downsampling], we can go ahead with a stricter algorithm like SVM, Gradient boosting or ADA boosting. 


19. List all assumptions for a data to be met before starting with linear regression.

Before starting linear regression, the assumptions to be met are as follows-
a. Linear relationship
b. Multivariate normality
c. No or little multicollinearity
d. No auto-correlation
e. Homoscedasticity


20. When does the linear regression line stop rotating or finds an optimal spot where it is fitted on data? 

A place where the highest RSquared value is found, is the place where the line comes to rest. RSquared represents the amount of variance captured by the virtual linear regression line with respect to the total variance captured by the dataset. 


21. Why is logistic regression a type of classification technique and not a regression? Name the function it is derived from? 

Since the target column is categorical, it uses linear regression to create an odd function that is wrapped with a log function to use regression as a classifier. Hence, it is a type of classification technique and not a regression. It is derived from cost function. 


22. Which machine learning algorithm is known as the lazy learner and why is it called so?

KNN is a Machine Learning algorithm known as a lazy learner. K-NN is a lazy learner because it doesn’t learn any machine learnt values or variables from the training data but dynamically calculates distance every time it wants to classify, hence memorises the training dataset instead. 


23. Is it possible to use KNN for image processing? 

Yes, it is possible to use KNN for image processing. It can be done by converting the 3-dimensional image into a single-dimensional vector and using the same as input to KNN. 

machine learning interview questions


24. Differentiate between K-Means and KNN algorithms?

KNN is Supervised Learning where-as K-Means is Unsupervised Learning. With KNN, we predict the label of the unidentified element based on its nearest neighbor and further extend this approach for solving classification/regression-based problems. 

K-Means is Unsupervised Learning, where we don’t have any Labels present, in other words, no Target Variables and thus we try to cluster the data based upon their coordinates and try to establish the nature of the cluster based on the elements filtered for that cluster.


25. How does the SVM algorithm deal with self-learning? 

SVM has a learning rate and expansion rate which takes care of this. The learning rate compensates or penalises the hyperplanes for making all the wrong moves and expansion rate deals with finding the maximum separation area between classes. 


26. What are Kernels in SVM? List popular kernels used in SVM along with a scenario of their applications.

The function of kernel is to take data as input and transform it into the required form. A few popular Kernels used in SVM are as follows: RBF, Linear, Sigmoid, Polynomial, Hyperbolic, Laplace, etc. 


27. What is Kernel Trick in an SVM algorithm?

Kernel Trick is a mathematical function which when applied on data points, can find the region of classification between two different classes. Based on the choice of function, be it linear or radial, which purely depends upon the distribution of data, one can build a classifier. 


28. Explain how ensemble techniques yield better learning as compared to traditional classification ML algorithms? 

Ensemble is a group of models that are used together for prediction both in classification and regression class. Ensemble learning helps improve ML results because it combines several models. By doing so, it allows a better predictive performance compared to a single model. 

They are superior to individual models as they reduce variance, average out biases, and have lesser chances of overfitting.


29. What are overfitting and underfitting? Why does the decision tree algorithm suffer often with overfitting problem?

Overfitting is a statistical model or machine learning algorithm which captures the noise of the data. Underfitting is a model or machine learning algorithm which does not fit the data well enough and occurs if the model or algorithm shows low variance but high bias.
In decision trees, overfitting occurs when the tree is designed to perfectly fit all samples in the training data set. This results in branches with strict rules or sparse data and affects the accuracy when predicting samples that aren’t part of the training set.


30. What is OOB error and how does it occur? 

For each bootstrap sample, there is one-third of data that was not used in the creation of the tree, i.e., it was out of the sample. This data is referred to as out of bag data. In order to get an unbiased measure of the accuracy of the model over test data, out of bag error is used. The out of bag data is passed for each tree is passed through that tree and the outputs are aggregated to give out of bag error. This percentage error is quite effective in estimating the error in the testing set and does not require further cross-validation. 


31. Why boosting is a more stable algorithm as compared to other ensemble algorithms? 

Boosting focuses on errors found in previous iterations until they become obsolete. Whereas in bagging there is no corrective loop. This is why boosting is a more stable algorithm compared to other ensemble algorithms. 


32. List popular cross validation techniques.

There are mainly six types of cross validation techniques. They are as follows-
a. K fold
b. stratified k fold
c. leave one out
d. bootstrapping
e. random search cv
f. grid search cv


33. Is it possible to test for the probability of improving model accuracy without cross validation techniques? If yes, please explain.

Yes, it is possible to test for the probability of improving model accuracy without cross validation techniques. We can do so by running the ML model for say n number of iterations, recording the accuracy. Plot all the accuracies and remove the 5% of low probability values. Measure the left [low] cut off and right [high] cut off. With the remaining 95% confidence, we can say that the model can go as low or as high [as mentioned within cut off points]. 


34. Name a popular dimensionality reduction algorithm.

Popular dimensionality reduction algorithms are Principal Component Analysis and Factor Analysis.
Principal Component Analysis creates one or more index variables from a larger set of measured variables. Factor Analysis is a model of the measurement of a latent variable. This latent variable cannot be measured with a single variable, and is seen through a relationship it causes in a set of y variables.


35. How can we use a dataset without the target variable into supervised learning algorithms? 

Input the data set into a clustering algorithm, generate optimal clusters, label the cluster numbers as the new target variable. Now, the dataset has independent and target variables present. This ensures that the dataset is ready to be used in supervised learning algorithms. 


36. List all types of popular recommendation systems? Name and explain two personalised recommendation systems along with their ease of implementation. 

Popularity based recommendation, content-based recommendation, user-based collaborative filter, and item-based recommendation are the popular types of recommendation systems.
Personalised Recommendation systems are- Content-based recommendation, user-based collaborative filter, and item-based recommendation. User-based collaborative filter and item-based recommendations are more personalised. Ease to maintain: Similarity matrix can be maintained easily with Item-based recommendation.


37. How do we deal with sparsity issues in recommendation systems? How do we measure its effectiveness? Explain. 

Singular value decomposition can be used to generate prediction matrix. RMSE is the measure that helps us understand how close the prediction matrix is to the original matrix.  


38. Name and define techniques used to find similarities in the recommendation system. 

Pearson correlation and Cosine correlation are techniques used to find similarities in recommendation systems. 


39. Keeping train and test split criteria in mind, is it good to perform scaling before the split or after the split? 

Scaling should be done post-train and test split ideally. If the data is closely packed, then scaling post or pre-split should not make much difference.


40. Define precision, recall and F1 Score?

The metric used to access the performance of the classification model is Confusion Metric. Confusion Metric can be further interpreted with the following terms:-

True Positives (TP) – These are the correctly predicted positive values. It implies that the value of the actual class is yes and the value of the predicted class is also yes.

True Negatives (TN) – These are the correctly predicted negative values. It implies that the value of the actual class is no and the value of the predicted class is also no

False positives and false negatives, these values occur when your actual class contradicts with the predicted class.


Recall, also known as Sensitivity is the ratio of true positive rate (TP), to all observations in actual class – yes

Recall = TP/(TP+FN)

Precision is the ratio of positive predictive value, which measures the amount of accurate positives model predicted viz a viz number of positives it claims.

Precision = TP/(TP+FP)

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.

Accuracy = (TP+TN)/(TP+FP+FN+TN

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have a similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

precision and recall - machine learning interview questions


41. What is Bayes’ Theorem? State at least 1 use case with respect to the machine learning context?

Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes’ theorem, a person’s age can be used to more accurately assess the probability that they have cancer than can be done without the knowledge of the person’s age.

Chain rule for Bayesian probability can be used to predict the likelihood of the next word in the sentence.


42. Explain the difference between Lasso and Ridge?

Lasso(L1) and Ridge(L2) are the regularization techniques where we penalize the coefficients to find the optimum solution. In ridge, the penalty function is defined by the sum of the squares of the coefficients and for the Lasso, we penalize the sum of the absolute values of the coefficients. Another type of regularization method is ElasticNet, it is a hybrid penalizing function of both lasso and ridge. 


43. What’s the difference between probability and likelihood?

Probability is the measure of the likelihood that an event will occur that is, what is the certainty that a specific event will occur? Where-as a likelihood function is a function of parameters within the parameter space that describes the probability of obtaining the observed data.

So the fundamental difference is, Probability attaches to possible results; likelihood attaches to hypotheses. 


44. Why would you Prune your tree?

Decision Trees are prone to overfitting, pruning the tree helps to reduce the size and minimizes the chances of overfitting. It serves as a tool to perform the tradeoff.


45. Model accuracy or Model performance? Which one will you prefer and why?

This is a trick question, one should first get a clear idea, what is Model Performance? If Performance means speed, then it depends upon the nature of the application, any application related to the real-time scenario will need high speed as an important feature. Example: The best of Search Results will lose its virtue if the Query results do not appear fast.

If Performance is hinted at Why Accuracy is not the most important virtue – For any imbalanced data set, more than Accuracy, it will be an F1 score than will explain the business case and in case data is imbalanced, then Precision and Recall will be more important than rest.


46. How would you handle an imbalanced dataset?

Sampling Techniques can help with an imbalanced dataset. There are two ways to perform sampling, Under Sample or Over Sampling. 

In Under Sampling, we reduce the size of the majority class to match minority class thus help by improving performance w.r.t storage and run-time execution, but it potentially discards useful information.

For Over Sampling, we upsample the Minority class and thus solve the problem of information loss, however, we get into the trouble of having Overfitting.

There are other techniques as well –

Cluster-Based Over Sampling – In this case, the K-means clustering algorithm is independently applied to minority and majority class instances. This is to identify clusters in the dataset. Subsequently, each cluster is oversampled such that all clusters of the same class have an equal number of instances and all classes have the same size

Synthetic Minority Over-sampling Technique (SMOTE) – A subset of data is taken from the minority class as an example and then new synthetic similar instances are created which are then added to the original dataset. This technique is good for Numerical data points.


47. Mention some of the EDA Techniques?

Exploratory Data Analysis (EDA) helps analysts to understand the data better and forms the foundation of better models. 


– Univariate visualization

– Bivariate visualization

– Multivariate visualization

Missing Value Treatment – Replace missing values with Either Mean/Median

Outlier Detection – Use Boxplot to identify the distribution of Outliers, then Apply IQR to set the boundary for IQR

Transformation – Based on the distribution, apply a transformation on the features

Scaling the Dataset – Apply MinMax, Standard Scaler or Z Score Scaling mechanism to scale the data.

Feature Engineering – Need of the domain, and SME knowledge helps Analyst find derivative fields which can fetch more information about the nature of the data

Dimensionality reduction — Helps in reducing the volume of data without losing much information


48. Differentiate between Statistical Modeling and Machine Learning?

Machine learning models are about making accurate predictions about the situations, like Foot Fall in restaurants, Stock-Price, etc. where-as, Statistical models are designed for inference about the relationships between variables, as What drives the sales in a restaurant, is it food or Ambience.


49. Differentiate between Boosting and Bagging?

Bagging and Boosting are variants of Ensemble Techniques, in Bagging, the goal is to reduce the variance of a decision tree classifier create several subsets of data from the training sample chosen randomly with replacement. Each collection of subset data is used to train their decision trees.

Example – Random Forest Classifier

In Boosting, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that the next hypothesis is more likely to classify it correctly. This process converts weak learners into a better performing model.

Example – AdaBoosting, Gradient Boosting, XGboosting


50. What is the significance of Gamma and Regularization in SVM?

The gamma defines influence. Low values meaning ‘far’ and high values meaning ‘close’.  If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.  If gamma is very small, the model is too constrained and cannot capture the complexity of the data.

The regularization parameter (lambda) serves as a degree of importance that is given to miss-classifications. This can be used to draw the tradeoff with OverFitting.

Stay tuned to this page for more such information on interview questions and career assistance. You can check our other blogs about Machine Learning for more information.

If you want to prepare more to land your dream job in ML, you can upskill with Great Learning’s PG program in Machine Learning. 

Pattern Recognition in Machine Learning: An Introduction

Reading Time: 6 minutes

Patterns are everywhere. It belongs to every aspect of our daily lives. Starting from the design and colour of our clothes to using intelligent voice assistants, everything involves some kind of pattern. When we say that everything consists of a pattern or everything has a pattern, the common question that comes up to our minds is, what is a pattern? How can we say that it constitutes almost everything and anything surrounding us? How can it be implemented in the technologies that we use every day?

Well, the answer to all these questions is one of the simplest things that all of us have been doing probably since childhood. When we were in school, we were often given the task of identifying the missing alphabets or to predict which number would come in a sequence next or to join the dots for completing the figure. The prediction of the missing number or alphabet involved analyzing the trend followed by the given numbers or alphabets. This is what pattern recognition in Machine Learning exactly mean. 

What is meant by pattern recognition?

Pattern Recognition is defined as the process of identifying the trends (global or local) in the given pattern. A pattern can be defined as anything that follows a trend and exhibits some kind of regularity. The recognition of patterns can be done physically, mathematically or by the use of algorithms. When we talk about pattern recognition in machine learning, it indicates the use of powerful algorithms for identifying the regularities in the given data. Pattern recognition is widely used in the new age technical domains like computer vision, speech recognition, face recognition, etc.

Types of pattern recognition algorithms in machine learning 

Types of Pattern Recognition

Supervised Algorithms-

The pattern recognition done using a supervised approach is called classification. These algorithms use a two-stage methodology for identifying the patterns. The first stage is involving the development/construction of the model and the second stage involves the prediction for new or unseen objects. The key features involving this concept are listed below.

  • Partition the given data into two sets- Training and Test set
  • Train the model using a suitable machine learning algorithm such as SVM (Support Vector Machines), decision trees, random forest, etc. 
  • Training is the process through which the model learns or recognizes the patterns in the given data for making suitable predictions.
  • The test set contains already predicted values.
  • It is used for validating the predictions made by the training set.
  •  The model is trained on the training set and tested on the test set.
  • The performance of the model is evaluated based on correct predictions made.
  • The trained and tested model developed for recognizing patterns using machine learning algorithms is called a classifier.
  • This classifier is used to make predictions for unseen data/objects.

Unsupervised Algorithms- 

In contrast to the supervised algorithms for pattern recognition which make use of training and testing sets, these algorithms use a group by approach. They observe the patterns in the data and group them based on the similarity in their features such as dimension to make a prediction. Let’s say that we have a basket of different kinds of fruits such as apples, oranges, pears, and cherries. We assume that we do not know the names of the fruits. We keep the data as unlabeled. Now, suppose we encounter a situation where someone comes and tells us to identify a new fruit that was added to the basket. In such a case we make use of a concept called clustering

    • Clustering combines or group items having the same features.
    • No previous knowledge is available for identifying a new item.
    • They use machine learning algorithms like hierarchical and k-mans clustering.
    • Based on the features or properties of the new object, it is assigned to a group to make a prediction.

Pattern Recognition and Machine Learning

Different tools used for pattern recognition in machine learning

  • Amazon Lex- It is an open-source software/service provided by Amazon for building intelligent conversation agents such as chatbots by using text and speech recognition.
  • Google Cloud AutoML– This technology is used for building high-quality machine learning models with minimum requirements. It uses neural networks (RNN -recurrent neural networks) and reinforcement learning as a base for model construction.
  • R-Studio – It uses the R programming language for code development. It is an integrated development environment for developing and testing pattern recognition models.
  • IBM Watson Studio – IBM Watson Studio is an open-source tool provided by IBM for data analysis and machine learning. It is used for the building and deployment of machine learning models on a desktop.
  • Microsoft Azure Machine Learning StudioProvided by Microsoft, this tool is using a drag and drop concept for building and deployment of the machine learning models. It offers a GUI (Graphical User Interface) based environment for model construction and usage.

Scope of pattern recognition in machine learning

  • Data Mining- It refers to the extraction of useful information from large amounts of data from heterogeneous sources. The meaningful data obtained from data mining techniques are used for prediction making and data analysis.
  • Recommender SystemsMost of the websites dedicated to online shopping make use of recommender systems. These systems collect data related to each customer purchase and make suggestions using machine learning algorithms by identifying the trends in the pattern of customer purchase.
  • Image processingImage process is basically of two types – Digital Image processing and Analog image processing. Digital image processing uses intelligent machine learning algorithms for enhancing the quality of the image obtained from distant sources such as satellites.
  • BioinformaticsIt is a field of science that uses computation tools and software to make predictions relating to biological data. For example, suppose someone discovered a new protein in the lab but the sequence of the protein is not known. Using bioinformatics tools, the unknown protein is compared with a huge number of proteins stored in the database to predict a sequence based on similar patterns.
  • Analysis Pattern recognition is used for identifying important data trends. These trends can be used for future predictions. An analysis is required in almost every domain be it technical or non-technical. For example, the tweets made by a person on twitter helps in the sentiment analysis by identifying the patterns in the posts using natural language processing.

Advantages of pattern recognition 

Using pattern recognition techniques provides a large number of benefits to an individual. It not only helps in the analysis of trends but also helps in making predictions.

    • It helps in the identification of objects at varying distances and angles.
    • Easy and highly automated.
  • It is not rocket science and does not require an out of the box thinking ability.
  • Highly useful in the finance industry to make valuable predictions regarding sales.
  • Efficient solutions to real-time problems.
  •  Useful in the medical fields for forensic analysis and DNA (Deoxyribonucleic acid) sequencing.

Importance of pattern recognition in machine learning

  • Pattern recognition identifies and predicts even the smallest of the hidden or untraceable data.
  •  It helps in the classification of unseen data.
  • It makes suitable predictions using learning techniques.
  • It recognizes and identifies an object at varying distances.
  • It not only helps in the prediction of the unseen data but also helps in making useful suggestion.

Importance of Pattern Recognition

Applications of pattern recognition 

  • Trend analysis– Pattern recognition helps in identifying the trend in the given data on which appropriate analysis can be done. For example, looking at the recent trends in the sales made by a particular company or organization, future sales can be predicted.
  • Assistance – Pattern is an integral part of our daily lives. It provides immense help in our day to day activities. A large number of software and applications are there in the market today that use machine learning algorithms to make predictions regarding the presence of obstacles and alerts the user to void miss happenings. 
  • E-Commerce – Visual search engines recognize the desired item based on its specifications and provide appropriate results. Most of the websites dedicated to online shopping make use of recommender systems. These systems collect data related to each customer purchase and make suggestions. All these tasks are accomplished by analyzing previous trends to make successful predictions.
  • Computer vision– The user interacts with the system by giving an image or video as the input. The machine compares it with thousands or maybe millions of images stored in its database, to find similar patterns. The drawl of the essential features is done by using an algorithm that is mainly directed for grouping similar looking objects and patterns. This is termed as computer vision. Example, cancer detection.
  • Biometric devices– These devices secure authentication and security by making using of face recognition and fingerprint detection technologies. On the hidden side, the base that enables the use of technologies like face and fingerprint recognition is machine learning algorithms.

Machine learning is one of the buzz words in the 21st century. It is highly in demand due to its applications and advantages. It has revolutionized all the industries with its amazing capabilities. Machine learning has different fields and scopes some of which include pattern recognition, data mining, analysis, etc. Pattern recognition in machine learning is widely used in almost every industry today be it technical or non-technical. It has helped in the analysis and visualization of various trends. It has not only increased the efficiency and ease of analysis and prediction making but has also increased the job opportunities in the field. Top-notch companies such as Microsoft, Google, Amazon are looking for individuals skilled in the art of pattern recognition and data analysis for making useful predictions. Thus, we can conclude by saying that pattern recognition is one of the most advancing fields in machine learning.

How AI can Improvise Healthcare Industry

Reading Time: 6 minutes

What has the data done?

The world today is driven by data. With every passing day, data is coming closer to us and getting involved with our lives. You cannot separate data and the economy. Even your country’s economic indicators are generated with the help of huge data and its processing. Data has become essential in all sectors of the economy, be it consumer goods, real estate, healthcare, manufacturing or any other industry. You need data to drive your business and more essentially drive growth. With more and more applications of data, the world is moving towards or to be more precise undergoing the biggest economic and technological transformations of all time.

What is AI?

AI stands for artificial intelligence. AI is a game-changer. It is the next big thing with the application of which humans are being replaced by machines, robots, and similar innovations. Artificial intelligence uses a huge amount of data to process a task and makes a company do what is accurate and required. With the advent of artificial intelligence in the economy, everything has become data-oriented. From your food habits to your sleep pattern, everything is being tracked and translated into consumer behavior somewhere by someone. Artificial intelligence has caused a sudden disruption in the world of technology and it couldn’t have come at a better time. Today, the corporations have become more consumer-oriented than ever and to serve their consumer betters, they need to know their consumers better. So, the behavior of the consumers is monitored regularly and their consumption pattern is tracked regularly and the products that best suit their needs are displayed on their screens in forms of ads, posts, etc. All of this involves a huge amount of data and this data is put to good use by the means of artificial intelligence.

The Healthcare Industry

The healthcare industry is huge. It is made up of multiple elements:

  1. Hospitals and clinics: These are the infrastructural components of the health care industry. They act as a starting point for all kinds of medical needs. Once you visit a doctor, you pay the hospital a fee. Then he recommends you a certain number of tests and medicines paving your way towards all the other elements of the healthcare industry. People who are frequent hospital visitors or are health-oriented are also driven towards the medical insurance industry. It all begins with the hospital.
  2. The Pharmaceutical giants: They are one of the most important elements of the health care industry. These companies produce drugs and medicines and decide a price point of the drug depending on various factors like demand, rarity, etc. These companies have huge revenues and the economy cannot do without these pharmaceutical companies.
  3. The Medical Insurance Agencies: The human body is always exposed to medical uncertainties. The fear of spending more when a crisis creeps in acts as a bait for these companies to bring in customers. Individuals invest in medical insurances by paying a premium amount in installments and hence saving a huge amount when a medical emergency kicks in. These days private companies also provide medical insurances to their employees and family members, thus driving huge revenues for companies working in the medical insurance domain.
  4. Pathology and Lab tests: The medical test market is huge. From blood tests to large CT scans, people visit such labs and get their tests done regularly thus making it a need of the hour. Most people get their tests done to keep up with their general well being or if their doctor has recommended them to get those tests done. Whatever may be the reason, these labs are an important element of the healthcare industry.

The Healthcare Industry and AI

Artificial Intelligence is the new butter to the bread of all the economic sectors. The healthcare industry is one of the major sections of the economy was not left untouched by it. With every passing day, the healthcare industry is growing. This growth translates to more people concerned about their health which translates into more people taking up the healthcare chain i.e. going to the hospital and ending up spending money on the associated costs. With more and more people, more and more data are coming into the picture. And with the involvement of data, artificial intelligence is put to its best use for doing things better and just right. The application of artificial intelligence has proved to be a kind of symbiotic for both the parties – the ones who are driving it and the ones who are availing the benefits of it. The healthcare industry forms almost 12% of any developed or developing economy and is making the best use of the resources available with it which are stored in the form of data. The health data of billions are stored in the form of hospital records, medical prescriptions, pathology reports, etc. The artificial intelligence makes the best use of this data. Artificial Intelligence is driving the healthcare industry in the following ways:

Artificial Intelligence applications in healthcare

  1. Optimizing cost plans and structures: With the huge amount of data available, various hospitals can plan better on where to spend and when to curtail. Also, these units can analyze what their consumers want and for what they can charge a premium. With big data, analysis of various costs and formulating strategies for better functioning of these medical units has become a cakewalk. With the growing number of people, economies of scale can be achieved and the benefits of it can be delivered to the ones seeking services and also to the ones who are driving such plans with the help of huge data present with them. 
  2. Supply chain efficiencies for Pharma companies: With the availability of the huge amount of clinical and patients’ data the drug manufacturers, with the help of big data, can easily analyze the medical conditions and histories for a certain group of individuals and what medicines they would need regularly. With AI, these giants segregate zones based on Geopolitical boundaries, Medical similarities, Disease rarity, etc. and devise methods to deliver their medicines based on such parameters. Also, with the help of the huge data available with them, they use artificial intelligence to finalize price points of various drugs based on what diseases they cure. The pharma sector is making the best possible use of artificial agencies which in turn is making medicines accessible to the patients and also helping the pharma companies to expand their market share.
  3. Healthcare Insurance market: A patient’s data is readily available these days, based on the medical conditions of a particular person, a well-curated insurance plan is chalked out and then presented him. People are generally lured by these policies as at the end of the day we all want to save money. The medical insurance market has a huge database of people, their medical conditions, details of their family members, etc. This gives them an edge and artificial intelligence makes the best use of the data available with them.
  4. The Fitness Industry: The fitness industry is a multi-billion-dollar industry.  People keep a tap on their health and well-being all the time. Applications like Google Fit are installed in phones and these apps track all your information be it your heart rate, your sleep cycle or your food habits. They generate a huge amount of data every second. This data can be both generic and individual specific. Corporations track consumer needs through such kind of applications and then use these insights for providing tailor-made solutions and hence earning a truckload of money. With the help of artificial intelligence, the healthcare industry is operating digitally and that too successfully.
  5. The Government: The government makes use of this health data to come up with various schemes. With the help of artificial intelligence, the government can also track the performance of the healthcare industry which translates into an economic indicator of growth and progress. The Government also uses this huge amount of data to keep a constant tap on the number of hospitals in a particular area, the medical supplies needed on a regular basis, availability of doctors and nurses in a particular area, outbreak of various calamities and communicable diseases, eradication of serious diseases like polio, malaria, dengue, etc. With the help of this data, the government educates a lot of people and help them in the upkeep of their well-being.

The healthcare industry and artificial industry

With artificial intelligence coming into play, the healthcare industry has drastically evolved. The whole industry has been taken away by the wave of digitization and this big wave has help companies to track information and bring in data points every minute. The healthcare industry has trillions and billions of dollars involved. This industry comes with huge chunks of data. One can track real-time information with the help of this evolution. Hospitals are getting more efficient, planning out their staff better, becoming more available for people so that they can reap the best out of it and a lot more. The pharma companies are generating huge revenues by supplying medicines to countries where they have a demand and hence, channelizing their supplies and revenue streams. The insurance market is becoming bigger and better. The availability of data is being translated into sales and revenues every hour. The healthcare industry and the amount of data involved in it take the whole world by surprise. Also, the money this application of data and artificial intelligence is driving is insane. Will artificial intelligence prove to be a boon or a bane is the debate of the future but right now it is making the world fast and the health space small and close.

A* Search Algorithm in Artificial Intelligence

Reading Time: 6 minutes

Intelligence is the strength of the human species; we have used it to improve our lives. Then, we created the concept of artificial intelligence, to amplify human intelligence and to develop and flourish civilizations like never before. AI helps us solve problems of various complexities. Computational problems like path search problems can be solved using AI. Search problems, where you need to find a path from one point to another, say, point A to point B. Sometimes you need to solve it by mapping those problems to graphs, where all the possible outcomes are represented by nodes. A* algorithm comes up as an answer to these problems. 

Created as part of the Shakey project aimed to build a mobile robot that has artificial intelligence to plan its actions, A* was initially designed as a general graph traversal algorithm. It is widely used in solving pathfinding problems in video games.  Because of its flexibility and versatility, it can be used in a wide range of contexts. A* is formulated with weighted graphs, which means it can find the best path involving the smallest cost in terms of distance and time. This makes A* an informed search algorithm for best-first search. Let us have a detailed look into the various aspects of A*. 

What is A-Star (A*) Algorithm in Artificial Intelligence?

The most important advantage of A* search algorithm which separates it from other traversal techniques is that it has a brain. This makes A* very smart and pushes it much ahead of other conventional algorithms. 

Consider the diagram below:

A* Algorithm

Let’s try to understand Basic AI Concepts and to comprehend how does A* algorithm work. Imagine a huge maze, one that is too big that it takes hours to reach the endpoint manually. Once you complete it on foot, you need to go for another one. Which implies that you would end up investing a lot of time and effort to find the possible paths in this maze. Now, you want to make it less time-consuming. To make it easier, we will consider this maze as a search problem and will try to apply it to other possible mazes we might encounter in the due course, provided they follow the same structure and rules.

As the first step to convert this maze into a search problem, we need to define these six things. 

  1. A set of prospective states we might be in
  2. A beginning and end state
  3. A way to decide if we’ve reached the endpoint
  4. A set of actions in case of possible direction/path changes
  5. A function that advises us about the result of an action 
  6. A set of costs incurring in different states/paths of movement

To solve the problem, we need to map the intersections to the nodes (denoted by the red dots) and all the possible ways we can make movements towards the edges (denoted by the blue lines).

A denotes the starting point and B denotes the endpoint. We define the starting and endpoint at the nodes A and B respectively.

If we use an uninformed search algorithm, it would be like finding a path that is blind, while an informed algorithm for a search problem would take the path that brings you closer to your destination. For instance, consider Rubik’s cube; it has many prospective states that you can be in and this makes the solution very difficult. This calls for the use of a guided search algorithm to find a solution. This explains the importance of A*.  

Unlike other algorithms, A* decides to take up a step only if it is convincingly sensible and reasonable as per its functions. Which means, it never considers any non-optimal steps. This is why A* is a popular choice for AI systems that replicate the real world – like video games and machine learning. 

Method Steps to use A* Algorithm

  1. Firstly, add the beginning node to the open list
  2. Then repeat the following step

 – In the open list, find the square with the lowest F cost – and this denotes the current square.

– Now we move to the closed square.

– Consider 8 squares adjacent to the current square and

  • Ignore it if it is on the closed list, or if it is not workable. Do the following if it is workable
  • Check if it is on the open list; if not, add it. You need to make the current square as this square’s a parent. You will now record the different costs of the square like the F, G and H costs. 
  • If it is on the open list, use G cost to measure the better path. Lower the G cost, the better the path. If this path is better, make the current square as the parent square. Now you need to recalculate the other scores – the G and F scores of this square. 

You’ll stop:

  • If you find the path, you need to check the closed list and add the target square to it.
  • There is no path if the open list is empty and you could not find the target square.

3. Now you can save the path and work backwards starting from the target square, going to the parent square from each square you go, till it takes you to the starting square. You’ve found your path now.  

Why is A* Search Algorithm Preferred? 

It’s easy to give movement to objects. But pathfinding is not simple. It is a complex exercise. The following situation explains it. 

Preferred A* Algorithm

The task is to take the unit you see at the bottom of the diagram, to the top of it. You can see that there is nothing to indicate that the object should not take the path denoted with pink lines. So it chooses to move that way. As and when it reaches the top it has to change its direction because of the ‘U’ shaped obstacle. Then it changes the direction, goes around the obstacle, to reach the top. In contrast to this, A* would have scanned the area above the object and found a short path (denoted with blue lines). Thus, pathfinder algorithms like A* help you plan things rather than waiting until you discover the problem. They act proactively rather than reacting to a situation. The disadvantage is that it is a bit slower than the other algorithms. You can use a combination of both to achieve better results – pathfinding algorithms give bigger picture and long paths with obstacles that change slowly; and movement algorithms for local picture and short paths with obstacles that change faster. 

A* and Its Basic Concepts

A* algorithm works based on heuristic methods and this helps achieve optimality. A* is a different form of the best-first algorithm. Optimality empowers an algorithm to find the best possible solution to a problem. Such algorithms also offer completeness, if there is any solution possible to an existing problem, the algorithm will definitely find it.  

When A* enters into a problem, firstly it calculates the cost to travel to the neighbouring nodes and chooses the node with the lowest cost. If The f(n) denotes the cost, A* chooses the node with the lowest f(n) value. Here ‘n’ denotes the neighbouring nodes. The calculation of the value can be done as shown below:


g(n) = shows the shortest path’s value from the starting node to node n

h(n) = The heuristic approximation of the value of the node

The heuristic value has an important role in the efficiency of the A* algorithm. To find the best solution, you might have to use different heuristic function according to the type of the problem. However, the creation of these functions is a difficult task, and this is the basic problem we face in AI. 

What is a Heuristic Function?

A heuristic as it is simply called, a heuristic function that helps rank the alternatives given in a search algorithm at each of its steps. It can either produce a result on its own or work in conjugation with a given algorithm to create a result. Essentially, a heuristic function helps algorithms to make the best decision faster and more efficiently. This ranking is made based on the best available information and helps the algorithm to decide on the best possible branch to follow. Admissibility and consistency are the two fundamental properties of a heuristic function.

Heuristic Function

Admissibility of the Heuristic Function

A heuristic function is admissible if it could effectively estimate the real distance between a node ‘n’ and the end node. It never overestimates and if it ever does, it will be denoted by ‘d’, which also denotes the accuracy of the solution. 

Consistency of the Heuristic Function

A heuristic function is consistent if the estimate of a given heuristic function turns out to be equal to, or less than the distance between the goal (n) and a neighbour, and the cost calculated to reach that neighbour.

A* is indeed a very powerful algorithm used to increase the performance of artificial intelligence. It is one of the most popular search algorithms in AI. Sky is the limit when it comes to the potential of this algorithm. However, the efficiency of an A* algorithm highly depends on the quality of its heuristic function. Wonder why this algorithm is preferred and used in many software systems? There is no single facet of AI where A*algorithm has not found its application. From search optimization to games, robotics and machine learning, A* algorithm is an inevitable part of a smart program. 

Reinforcement Machine Learning

Reading Time: 6 minutes

You might have seen robots doing mundane tasks like cleaning room or serving beer to people. However, these actions are usually remote-controlled by a human. These robots are physically capable of doing things following a set of instructions given to them, but they lack the basic intelligence to decide and do things by themselves. Embedding intelligence is a software challenge, and reinforcement learning, a subfield in machine learning, provides a promising direction towards developing intelligent robotics. 

Reinforcement learning is concerned with how an agent uses the feedback to evaluate its actions and plan about future actions in the given environment to maximize the results. In reinforcement learning, the agent is empowered to decide how to perform a task, which makes it different from other such machine learning models where the agent blindly follows a set of instructions given to it. The machine acts on its own, not according to a set of pre-written commands. Thus, reinforcement learning denotes those algorithms, which work based on the feedback of their actions and decide how to accomplish a complex task. 

These algorithms are rewarded when they make the right decision and are punished when they make the wrong decision. Under favourable conditions, they can do a superhuman performance. Here is an introduction to reinforcement machine learning and its applications. 

Importance of Reinforce Learning

We need technological assistance to simplify life, improve productivity and to make better business decisions. To achieve this goal, we need intelligent machines. While it is easy to write programs for simple tasks, we need a way out to build machines that carry out complex tasks. To Achieve this is to create machines that are capable of learning things by themselves. Reinforce learning does this.

Reinforcement Learning Basics

Basics of reinforcement machine learning include:

  • An Input, an initial state, from which the model starts an action
  • Outputs – there could be many possible solutions to a given problem, which means there could be many outputs
  • The training on deep reinforcement learning is based on the input, and the user can decide to either reward or punish the model depending on the output. The model decides the best solution based on the maximum reward.
  • The model considers the rewards and punishments and continues to learn through them.

Reinforcement Learning: Types 

Reinforcement is of two different types: positive and negative

  • Positive Reinforcement

A reinforcement is considered positive when a given event has a positive effect such as an increase in the frequency and strength of the behaviour. 

Positive reinforcement has the following advantages:

  • It gives the maximum possible performance
  • It sustains the change for a long time

Positive reinforcement has a disadvantage as well – if the reinforcement is too much, it could cause overload and weaken the result.

  • Negative Reinforcement

A reinforcement is considered negative when an action is stopped or dodged due to a negative condition.

Deep Reinforcement Learning

Deep reinforcement learning uses a training set to learn and then applies that to a new set of data. It is a bit different from reinforcement learning which is a dynamic process of learning through continuous feedback about its actions and adjusting future actions accordingly acquire the maximum reward.

Fields of Applications 

  • Gaming
  • Robotics
  • E-commerce
  • Self-driving cars
  • Industrial automation
  • Stock price forecasting
  • News
  • Design training systems
  • Web search engines like Google
  • Photo tagging applications
  • Spam detector applications
  • Weather forecasting application

Definitions in Reinforcement Learning

There are several concepts and definitions in reinforcement learning. Major ones are listed below:

Agent: Agent is the one that takes actions. For instance, Super Mario is an agent as it navigates a video game. 

Action (A): It is the collection of all possible moves any agent is capable of making.  It is self-explanatory, and the agents can choose from a set of possible actions. 

Discount factor: To fight against delayed gratification, we need to make immediate rewards greater than future rewards. The discount factor is used for this and thus apply a short-term gratification in the agent. 

Environment: Just as the word implies, the ‘environment’ is the surroundings through which the agents move.  The environment considers the action and the current state of the agent as the input and grants a reward for the agent in the next state, and that is the output.

State: This refers to the current situation where the agent places itself – such as a specific place or action. A state relates the agent to other relevant things such as obstacles, rewards, enemies and tools. 

Reward: This denotes the feedback given for an action taken by the agent. The feedback is an evaluation of the agent’s action and decides if it is a success or failure. 

Policy: This denotes the agent’s strategy to decide the next course of action. Each policy is taken based on the current state. It aims to do those actions that bring in the highest reward. 

Value: Denotes expected long-term return to the current state, in contrast to the short-term rewards.  

Q-value or action-value: It is very similar to the concept of value, except that it considers the current action as well.  Q-value is the one that maps the state and action to rewards. Trajectory: This denotes several states lined in a sequence and the actions that could influence them. 

 *Image credit: Sutton & Barto 

From the feedback loop given above, an agent does a certain action based on the environment it is, in and this constitutes the state. The agent’s action and the environment are considered and then a feedback is generated, which decides if that action is a success or failure. The goal could be different in different scenarios. 

  • The goal in a video game may be to finish the game with maximum points. Hence, each additional point gained in the game will affect the subsequent action of the agent.
  • The goal in the real world may be to travel between two points, say, A to B. Every small unit the robot moves closer towards point B could be counted as points.

Pros and Cons of Reinforcement Machine Learning


  • It helps to solve very complex problems that conventional techniques fail to solve
  • It gives long-term results that are very difficult to accomplish.
  • This model works like human learning pattern and hence, demonstrates perfection in every action.
  • The model is capable of learning from the errors and corrects them. So there is a very little chance of repetition of the same error. 
  • It learns from experience and hence a dataset is not needed to guide its actions. 
  • It provides scope for an intelligent examination of the situation-action relation and creates the ideal behaviour within a given context, that leads to maximum performance.


  • Too much of reinforcement may cause an overload which could weaken the results.
  • Reinforcement learning is preferred for solving complex problems, not simple ones.
  • It requires plenty of data and involves a lot of computation. 
  • Maintenance cost is high

Challenges Faced by Reinforcement Learning

As mentioned earlier, reinforcement learning uses feedback method to take the best possible actions. This makes it suitable for finding a solution for many complex problems and it has found application in many domains. But it faces many challenges as well. The main one is the challenge in creating the simulation environment that depends a lot on the chosen task. In chess or Go games, where the model has to perform superhuman tasks, the environment is simple. However, it is a bit complex when you consider a real-life application like designing an autonomous car model where you need a highly realistic simulator. This is crucial as you are going to drive the car on the street. The model must be capable of figuring out how and when to apply the brake or how to avoid a collision. It could not be a problem in a virtual world, but it becomes a hard-to-crack-problem when you need to hit the real world. Things get tricky when you transfer the model from the safe training environment into the real world.

Another challenge lies in tweaking and scaling the neural network that controls the agent.  It is complex because the only way to communicate with the network is through rewards and penalties. The major challenge associated with this is that this could lead to catastrophic forgetting or in other words, this might cause some old knowledge to get erased as it acquires new knowledge. 

Another challenge is that sometimes the agent does a task just as it is, which means the model does not achieve the optimal output. For example, the model causes a jumper to just jump like a kangaroo, instead of leading the agent to do things that we expect the agent to do – such as walking.  

Last but not least, there could arise a problem where the agent just optimizes the prize but does not intend to do the task. Consider the open AI video as an example of this. In this video, the agent learned to bag the rewards without completing the race. 

There is no doubt that reinforcement machine learning has huge potential to change the world. The biggest advantage of this cutting-edge technology is that it is capable of learning by itself through trial and error, just like human beings. It makes mistakes, corrects them, learn from them to avoid making the same mistake in the future. It can be best combined with other machine learning technologies for better performance. No wonder that it is used in many real-world applications such as robotics, gaming to mention some. It is the best way to incorporate creative and innovation to perform a task. Reinforcement learning surely has the potential to become a revolutionary technology in the future development of artificial intelligence. 


Top 5 Use Cases- How Artificial Intelligence And Big Data Drive Success-Domain-Artificial Intelligence

Reading Time: 6 minutes

There is one thing that you will agree with me—that the fourth industrial revolution is here, thanks to AI and Big Data. In fact, according to a report done by CNN there is no accurate figure on the amount of data in Big Data. It is estimated that more than 90% of all data existing right now was created last year – and it keeps increasing. 

Even though this is a lot of data, it is useless in its raw form. This is where AI comes in to make sense of the data. As a result, more companies are now embracing AI. According to a survey done by New Vantage Partner more than 95% of Fortune 1000 executives in sixty companies were investing in AI and Big Data. Artificial intelligence and Big Data have revolutionized every industry and companies are continually seeking out ways to make AI and Big Data work for them.  

AI, Big Data and soft drinks 

Everyday people consume 1.9 billion servings of Coca Cola drinks. Due to this huge market share in the beverage space, Coca Cola generates a lot of data that it uses to make strategic decisions. Coca Cola was the earliest non-IT company to adopt AI and Big Data. 

Coca Cola is known for investing heavily in research and development. After having a successful launch of self-service soft drinks and fountains, Coca Cola gathered all this data. It then combined it with AI and using the insights acquired it launched a new beverage brand—Cherry Sprite. Cherry Sprite was based on the data collected by clients mixing their drinks to create the perfect cocktail beverage.

AI in softdrink industry

AI, Big Data and Education

Elsevier is one of the global leaders in the publishing of medical and scientific information. Over the years Elsevier has leveraged AI and Big Data to improve its operations. Due to the massive amount of data that the company has collected in its 140-year existence it has built advanced analytics systems using AI and Big Data. 

Traditionally, getting information from any of their publications meant that you had to get a paper-based publication or book. However, as Internet usage spread, Elsevier decided to leverage this and digitize its literature. After digitizing the company saw another challenge – information overload. It is estimated that data in the world doubles every two years. 

Not all this data is useful, and Elsevier realized that they needed to find a way to cut through the noise and give their readers valuable and actionable data. This is where AI and Big Data came in. AI and Big data were used to study the reading habits of their clients and analyze which kind of data was being consumed a lot. Using these insights they then deployed machine learning to present relevant information to their readers.

The case for AI, Big Data and Fake News

You have probably heard of fake news where people post false information about others on the Internet. The news could be something as trivial as rumours about one’s dating life to graver news like the death of the same person. Due to the many media outlets like social media, blogs and even traditional media fake news can spread like a wild bush fire, and the truth only comes out later on after the damage has been done. 

Fake news is a significant issue, and one Google-funded company is leveraging AI and Big Data to combat fake news. Veri-Flix, a Belgium company, is leveraging AI—with a focus on machine learning to fight fake news. These days citizen-journalism has become a force that even mainstream media institutions cannot ignore. Anyone can upload a video and make fake news about it. By employing machine learning they can scan user-uploaded videos and determining their credibility. This is done by collecting data like video content, time-stamps, and geolocation. Using this data they tag several videos and compare them against each other. After garnering an award and funding from Google the company’s technology is being put to the test at the country’s largest media station. 

AI in entertainment

AI, Big Data and food

McDonald’s, one of the biggest fast-food chains on the planet, is one company that needs no introduction. For a long time, the fast-food chain did not see the need to implement AI and Big Data but upon seeing the success that their competitors were having, they revised their strategy. As the largest fast-food chain in the world, serving close to 70 million people daily, it’s clear that it generates a lot of data. Moreover, they have leveraged that data in many ways as shown below

        Use of digital menus

Digital menus are a common phenomenon these days, but McDonald’s has taken it a notch higher by introducing digital menus that change based on real-time data analysis. The menus vary also based on parameters like weather and time of day. Due to this innovation, they recorded a 3% increase in sales in Canada. 

        Customer Experience

McDonalds are leveraging the application to make it a win-win situation for them and their clients. Users of their app get various benefits like:

  • Exclusive deals on the app.
  • Avoiding long queues at the counter and drive-thru 

On the other hand, when customers pay through the application, McDonald’s not only got money but also vital customer data that includes metrics like:

  • Where and when the client goes to the restaurant
  • The frequency of their visits
  • Preference between a drive-thru or a restaurant

By using this data, McDonald’s can make recommendations and promote deals to increase sales. As a result of this data, they have noticed a more than 30% increase in sales in Japan for clients that used the app. 

Autonomous ships, Big Data and AI

Most people know that the future lies in autonomous automobiles, but not many people know that there are also plans to launch autonomous ships. This is due to a collaboration between Google and Rolls-Royce to create autonomous and smart ships. Rolls-Royce will be using the Machine Learning Engine on Google Cloud in its applications to make its vision of smarter and autonomous ships come true. 

At first, AI algorithms will be trained using machine learning to identify objects that can be encountered at sea and classify them based on hazard they may pose. The Machine learning algorithms that are currently being used by Google Voice and image search applications. They will also be augmented by massive data sets produced from various devices like sensors, cameras, and cameras on vessels. By combining the cloud-based AI and Big Data application enable data to be shared in real-time to any ship and also to on-shore control centres. 

Healthcare, Big Data and AI

One of the issues that many healthcare systems face is matching staffing volumes to patient numbers. At one time you can have very few patients and a huge staff roster. During other times, you have an overflow of patients and a strained workforce. So how do you solve this problem? Embrace AI and Big Data by following the example set by some hospitals in Paris.

Four hospitals in Paris have managed to leverage AI and Big Data to enable nurses, doctors, and hospital administrators to forecast admission and visit rates for two weeks. This enables them to draft in extra staff when they expect high patient volumes leading to reduced wait times and better-quality care. 

So how does the system work? Using an open-source AI Analytics platform, the hospitals compiled admission data for the last ten years and external data sets like flu patterns, weather and public holidays. The insights were then used to predict admission rates at different times. Apart from just being used to predict admission rates, such data can be used to reduce wastage and enhance healthcare delivery by forecasting the demand for services. 

Big Data, AI and Security

When it comes to security screening, most of us expect that you will find a security person screening you individually using a face-to-face approach. Although customs and immigration officers are highly trained to detect someone that is lying about their intentions mistakes do still happen. Also, there is the fact that humans get tired and can be distracted leading to errors. So how do we avoid such human errors? Apply AI and Big Data to screen passengers. 

Homeland security has developed a new system called AVATAR that screens people’s facial expressions and body gestures and picks up small variations that may raise suspicion. The system has a screen with a virtual face that asks the passenger questions. It monitors the person’s responses for changes in voice tone as well as differences in what was said. Data collected is compared to a database and compared against factors that show someone might be lying. If the passenger is flagged as being suspicious then they are highlighted for further inspection.

AI and big data are indeed the fourth revolutions. The potential for AI and Big Data is endless. No industry has been disrupted by AI and Big data, and the future belongs to those that leverage it to their benefit. 

Top 5 Sources For Analytics and Machine Learning Datasets

Reading Time: 5 minutes

Machine learning becomes engaging when we face various challenges and thus finding suitable datasets relevant to the use case is essential. Its flexibility and size characterise a data-set. Flexibility refers to the number of tasks that it supports. For example, Microsoft’s COCO( Common Objects in Context) is used for object classification, detection, and segmentation. Add a bunch of captions for the same, and we can use it as a dataset for an image caption generator as well. That’s the power of a robust dataset. Well, when we are just starting, we shall be working with some of the small and standard machine learning datasets like the CIFAR-10, MNIS, Iris, etc. These datasets are preloaded in many of the libraries these days and can be quickly loaded. Keras, scikit-learn provide options for the same.

Machine Learning: Important Dataset Sources

Let us begin by finding machine learning datasets that are problem-specific, and hopefully cleaned and pre-processed.

It surely is a strenuous task to find specific datasets like MS-COCO for all varieties of problems. Therefore, we need to be intelligent about how we use datasets. For example, using Wikipedia for NLP tasks is probably the best NLP dataset there possibly is. In this article, we discuss some of the various sources for Machine Learning Datasets, and how we can proceed further with the same. A word of caution, be careful while reading the terms and conditions that each of these datasets impose, and follow accordingly. This is in the best interest of everyone indeed.

machine learning datasets

1. Google’s Datasets Search Engine:

Google has been the search engine giant, and they helped all the ML practitioners out there by doing what they are legends at, helping us find datasets. The search engine does a fabulous job at getting datasets related to the keywords from various sources, including government websites, Kaggle, and other open-source repositories.

2. .gov Datasets:

With the United States, China and many more countries becoming AI superpowers, data is being democratised. The rules and regulations related to these datasets are usually stringent as they are actual data collected from various sectors of a nation. Thus, cautious use is recommended. We list some of the countries that are openly sharing their datasets.

Indian Government Dataset

Australian Government Dataset

EU Open Data Portal

New Zealand’s Government Dataset

Singapore Government Dataset

3. Kaggle Datasets

Kaggle is known for hosting machine learning and deep learning challenges. The relevance of Kaggle in this context is that they provide datasets, and at the same time provide a community of learners and ML practitioners, whose work shall help us with our progress. Each challenge has a specific dataset, and it is usually cleaned so that we don’t have to do the bland work of cleaning necessarily and can instead focus on refining the algorithm. The datasets are easily downloadable. Under the resources section, there are prerequisites and links to learning material, which helps us whenever we are stuck with either the algorithm or the implementation. Kaggle is a fantastic website for beginners to venture into applications of machine learning and deep learning and is a detailed resource pool for intermediate practitioners of machine learning.

4. Amazon Datasets (Registry of Open Data on AWS)

Amazon has listed some of the datasets available on their servers as publicly accessible. Therefore, when using AWS resources for calibrating and tweaking models, using these locally available datasets will fasten the data loading process by tens of times. The registry contains several datasets classified according to the field of applications like satellite images, ecological resources, etc.

5. UCI Machine Learning Repository

UCI Machine Learning Repository provides easy to use and cleaned datasets. These have been the go-to datasets for a long time in academia.


6. Yahoo WebScope

An exciting feature that this website provides is it lists the paper which used the dataset. Therefore, all research scientists and people from academia will find this resource handy. The datasets available cannot be used for commercial purposes. For more details, check the websites of the datasets provided.

7. Datasets subreddit

The subreddit can be used as a secondary guide when all other options lead nowhere. People usually discuss the various available datasets and how to use existing datasets for new tasks. A lot of insights regarding the necessary tweaking required for datasets to work in different environments can be obtained as well. Overall, this should be the last resource point for datasets.

Let’s focus on datasets specific to the major domains that have seen accelerated progress in the last two decades. Having domain-specific datasets available enhances the robustness of the model, and thus more realistic and accurate results are possible. The areas include computer vision, NLP and, Data analytics. 

Datasets for other applications 

Computer Vision Datasets

There are several computer vision datasets available. The choice of the dataset depends on the level of competence we are working with. The pre-loaded datasets on Keras and scikit-learn are sufficient for learning, experimenting and implementing new models. The downside with these datasets is that the chances of overfitting of the model are high due to the low complexity in the datasets. Therefore, for intermediate ML practitioners and organisations solving specific problems can refer to various sources: A variety of resources and datasets are available on the website. It lists most of the open-source datasets and redirects the user to the dataset’s webpage. The datasets available can be used for classification, detection, segmentation, image captioning and many more challenging tasks This website lists out almost all the available datasets. It makes it easy for finding relevant datasets by providing the option of searching with the help of tags associated with each dataset. We highly recommend our readers to try this website out.

Natural Language Processing

NLP is growing at a phenomenal pace, and recently language modelling has had its imagenet moment, wherein people can start building applications with state of the art conversational NLP agents. When it comes to NLP, several scenarios require task-specific catered datasets. NLP deals with sentiment analysis, audio processing, translation, and many more challenging tasks. Therefore, it is necessary to have a massive list of datasets: Majority of the datasets in the domain are listed in the following GitHub repository. The datasets on this website are cleaned and provide a vast database to choose from. The appealing and easy-to-use interface makes this a highly recommended choice.

Statistic and Data Science 

Data Science covers a range of tasks including creating recommendation engines, predicting parameters given the data, like time-series data, and doing exploratory and analytical research. Small organisations and individual practitioners don’t have what the big giants have, that is the data, and hence open datasets such as these is a huge boon to create actual models that reflect real data, and not simulated data. There are various datasets available for specific tasks, and it’s a wonderful resource point. These are benchmark datasets and can be used for comparing the results of the model built with the benchmark results. 

This is an exhaustive list of datasets for machine learning, analytics, and other applications. We wish you the best of luck while implementing models. Also, we hope you come up with models that can match the benchmark results.

If you are interested in learning Machine Learning concepts and pursue a career into the domain, upskill with Great Learning’s PG Program in Artificial Intelligence and Machine Learning

4 Reasons Why a US University Should Be Your Top Choice for Data Science & AI Programs

Reading Time: 4 minutes

The data revolution is well underway. As the demand for data scientists and AI professionals grows by leaps and bounds, we are witnessing a rising number of courses promising to “turn you into a data scientist” almost proportionately to the explosive job market demand. 

Choosing amongst this plethora of courses can prove to be difficult, especially if you are not clear on what you are looking for. While your needs may vary, here’s a piece of general advice: look out for courses that get you job-ready

Job-readiness forms the difference between taking a course up casually to gain a few skills vs taking a course to achieve positive outcomes from the time you invest in it. Job-readiness could be reflected in how hands-on the program is, how it supports you in achieving a smooth transition, and even how it adds on to your employability in the industry. 

With that said, there are distinct advantages of choosing a course from a US University, not the least of which is enhanced job-readiness in the global market. 

So, let’s take a look at a few reasons why a US University could be your best bet for data science and artificial intelligence courses:   

  1. U.S universities hold an excellent international reputation 

It is a widely-known fact that U.S universities maintain a definitive presence amongst the top-ranked universities in the world. In fact, on the QS list of top 100 universities in the world, the largest majority of list-makers were of American origin (a whopping 29%). 

Often considered the holy grail of higher education by international students, a large part of its reputation can be attributed to high education standards, academic rigour, and advanced research capabilities. 

This strong brand name and reputation are likely to rub-off on your resume as well, lending you credibility and recognition wherever you may be in the world. 

Did you know? The University of Texas at Austin has been ranked at #2 for Business Analytics and #8 for Artificial Intelligence world-wide. 

  1. The U.S is a front-runner in the AI, Machine Learning and Data Science space

Not just a global leader in Science and Technology, the United States also is currently leading the pack when it comes to pioneering technological domains of the 21st century, namely Artificial Intelligence, Machine Learning and Data Science. 

In a 2017 report, the U.S came out on top in the list of countries leading in AI research; and of the top 100 AI startups in 2019, 77 belonged to the United States. Couple this with the presence of 7 major tech hubs (including the ubiquitous Silicon Valley), it should hardly come as a surprise that the U.S is a centre of innovation for advanced technologies. 

This tech dominance is reflected in the courses designed by U.S universities too, and you can rely on them to acquaint you with the latest techniques, concepts, and the most cutting-edge technologies through their programs. 

Did You Know? The University of Texas at Austin’s Data Science and AI and Machine Learning programs incorporate the latest and most widely-used tools and techniques in their curricula. 

Top Data science, AIML universities

  1. U.S universities lend you global mobility

As an extension of the positive reputation of U.S universities, having the name of one on your resume can provide you with the flexibility to work anywhere in the world. 

As the global job demand for Data Science and AI rises, companies across the world are on the look-out for competent professionals to meet it. They are less stringent about your nationality and more particular about what you bring to the table in terms of skills and expertise. 

A look at the job market demand analysis for data scientists shows us that this demand is far from being centred in a single location, so if you want the flexibility to seek out opportunities outside your home country, a course from a U.S university should be at the very top of your list. 

Did You Know? UT Austin’s postgraduate programs offer you career development guidance to support your transition into a career in Data Science or AI/ML. 

  1. U.S universities are leaders in academic excellence

Since Data Science and AI are still relatively new fields, the exact recipe for success is yet to be established. However, what is not debatable is that you need to possess a mix of technical expertise, domain knowledge and business acumen to be able to thrive.  

Thus, whichever course you take up should be able to equip you with thorough (and hands-on) knowledge of the field, along with an understanding of how it applies to business. 

U.S universities are recognized throughout the world for their academic brilliance and for inculcating knowledge through diverse teaching methods, such as case studies, quizzes, problem walkthroughs, and projects. The end result is industry-ready candidates with tangible expertise to boast of.

Did you Know? UT Austin’s AI & Machine Learning and Data Science programs bring you the best of academia and the industry, through a learning model incorporating a combination of recorded lectures from renowned global faculty and mentoring sessions with current practitioners. 

Summing Up 

It’s clear by now that U.S universities do have an upper hand when it comes to data science, artificial intelligence and machine learning courses. Now it is up to you to decide which one can bring you the career outcomes you desire. 

All the best!