Top Interview Questions For Cloud Computing You Should Know

Reading Time: 9 minutes

With 3.6 billion people actively using Cloud Services in 2018, Cloud Computing has become popular than ever before. In this article, we will discuss the top Cloud Computing interview questions. 

With an unfathomable volume of data, it becomes cumbersome for industries to manage their data. Thus, Cloud Computing is like the straw to the drowning industries in the ocean of data. Amazon, Microsoft, Deloitte, Lockheed Martin, are among the top recruiters for cloud computing professionals. 

cloud computing interview questions

(Source: Forbes)

According to a survey, the average salary of an entry-level cloud professional is around 8 lacs per annum, 12-15 lacs for professionals with under 3 years of experience, and for individuals with 10+ years of experience, the salary is a whopping 30 lacs or more.

Check out our PGP program in Cloud Computing, get trained by industry professionals
Solve 15+ use cases, and many more challenging projects as well. Enroll Now!

For all aspiring cloud computing architects, here is a curated list of cloud computing interview questions.

So let’s begin:

1. How will you describe Cloud Computing as concisely and simply to a Layman?

Even though this might sound like a fundamental question, this was asked in one of the interviews. (source- Quora)

Now, you must use simple words while answering this question. Use of technical terms is not advised.

In cloud computing, ‘cloud’ refers to the internet, metaphorically. So cloud computing is a method where internet acts as the fuel to computing services. You can now use the word- Internet-based computing. 

2. Give the best example of open source Cloud Computing.

Opensource cloud is a cloud service or solution that is built using opensource software and technologies. This includes any public, private or hybrid cloud model providing SaaS, IaaS, PaaS, or XaaS built and operated entirely on opensource technologies.

The best example of open source Cloud Computing is OpenStack.

3. What are system integrators in cloud computing?

System Integrators emerged into the scene in 2006. System integration is the practice of bringing together components of a system into a whole and making sure that the system performs smoothly.

A person or a company which specializes in system integration is called as a system integrator.

4. List the platforms which are used for large-scale cloud computing.

The timely processing of massive digital collection demands the use of large-scale distributed computing resources and the flexibility to customize the processing performed on the collections.

The platforms that are used for large-scale cloud computing are:

– Apache Hadoop

– MapReduce

5. Mention the different types of models used for deployment in cloud computing.

You need the perfect cloud deployment model to help you gain a competitive edge in the market. Through this, you will have access to IT resources and services that can make your business flexible and agile, both concerning volume and scale.

The different deployment models in cloud computing are:

– Private Cloud

– Public Cloud

– Community Cloud

– Hybrid Cloud

6. What do you mean by software as a service?

Software as a service (SaaS) is a software distribution model in which a third-party provider hosts applications and makes them available to their customers over the Internet. SaaS is one of three main categories of cloud computing, alongside infrastructure as a service (IaaS) and platform as a service (PaaS).

7. What is the platform as a service?

Platform as a service (PaaS) is a cloud computing model wherein a third-party provider delivers hardware and software tools. These tools usually comprise of those needed for the development of applications. PaaS services are provided to users over the internet. The provider hosts the hardware and software. As a result, PaaS gives users the flexibility to use service without installing hardware and software to run an application.

8. What is a private cloud?

Private cloud is one which delivers similar advantages to public cloud-like scalability and self-service. In the private cloud, this is done by using a proprietary architecture. Private clouds focus on the needs and demands of a single organization.

As a result, the private cloud is best for businesses with dynamic or unpredictable computing needs that require direct control over their environments. Security, governance, and regulation are best suited for private cloud services.

Private clouds are used to keep the strategic operations and others secure. It is a complete platform which is fully functional and can be owned, operated and restricted to only an organization or an industry. Nowadays, most of the organizations have moved to private clouds due to security concerns. Virtual private cloud is being used that operate by a hosting company.

9. What is the public cloud?

Be it a public or private cloud, the primary objective is to deliver services using the internet. Unlike a private cloud, public cloud services are the third party applications which can be used by anybody who wants to access them. The service may be free or be sold on demand.

Public clouds are open to people for use and deployment. For example Google and Amazon etc. The public clouds focus on a few layers like cloud application, providing infrastructure, and providing platform markets.

10. What are Hybrid Clouds?

Hybrid cloud is a cloud computing environment where we can use the services available to us locally, use third-party private services, and public services as well to meet the demand. By allowing workloads to move between private and public clouds as computing needs and costs change, hybrid cloud gives businesses greater flexibility and more data deployment options.

Hybrid clouds are a combination of public clouds and private clouds. It is preferred over both the clouds because it applies the most robust approach to implement cloud architecture. It includes the functionalities and features of both worlds. It allows organizations to create their cloud and allow them to give control over to someone else as well.

11. What is the difference between cloud computing and mobile computing?

Cloud Computing is when you store your files and folders in a “cloud” on the Internet, this will give you the flexibility to access all your files and folders wherever you are in the world– but you do need a physical device with Internet access to access it.

Mobile computing is taking a physical device with you. This could be a laptop or mobile phone or some device. Mobile computing and cloud computing are somewhat analogous. Mobile computing uses the concept of cloud computing. Cloud computing provides the users with the data which they require while in mobile computing, applications run on the remote server and give the user access for storage and managing the data.

12. What is the difference between scalability and elasticity?

Scalability is a characteristic of cloud computing which is used to handle the increasing workload by increasing in proportion amount of resource capacity. By the use of scalability, the architecture provides on-demand resources if the traffic is raising the requirement. Whereas, Elasticity is a characteristic which provides the concept of commissioning and decommissioning of a large amount of resource capacity dynamically. It is measured by the speed at which the resources are on-demand and the usage of the resources.

13. What are the security benefits of cloud computing?

Complete protection against DDoS: Distributed Denial of Service attacks have become very common and are attacking cloud data of companies. So the cloud computing security ensures restricting traffic to the server. Traffic which can be a threat to the company and their data is thus averted.

Security of data: As data develops, data breaching becomes a significant issue and the servers become soft targets. The security solution of cloud data helps in protecting sensitive information and also helps the data to stay secure against a third party.

Flexibility feature: Cloud offers flexibility, and this makes it popular. The user has the flexibility to avoid server crashing in case of excess traffic. When the high traffic is over, the user can scale back to reduce the cost.

Cloud computing authorizes the application server, so it is used in identity management. It provides permissions to the users so that they can control the access of another user who is entering into the cloud environment.

14. What is the usage of utility computing?

Utility computing, or The Computer Utility, is a service provisioning model in which a service provider makes computing resources and infrastructure management available to the customer as needed and charges them for specific usage rather than a flat rate

Utility computing is a plug-in managed by an organization which decides what type of services has to be deployed from the cloud. It facilitates users to pay only for what they use.

15. Explain Security management regarding Cloud Computing.

– Identity management access provides the authorization of application services

– Access control permission is given to the users to have complete controlling access of another user who is entering into the cloud environment

– Authentication and Authorization provide access to authorized and authenticated users only to access the data and applications

16. How would you secure data for transport in the cloud?

This is a frequently asked question. Don’t forget to dive in more in-depth on this topic.

When transporting data in a cloud computing environment, keep two things in mind: Make sure that no one can intercept your data as it moves from point A to point B in the cloud, and make sure that no data leaks (malicious or otherwise) from any storage in the cloud.

A virtual private network (VPN) is one way to secure data while it is being transported in a cloud. A VPN converts the public network to a private network instead. A well-designed VPN will incorporate two things:

A firewall that will act as a barrier between the public and any private network.

Encryption protects your sensitive data from hackers; only the computer that you send it to should have the key to decode the data.

Check that there is no data leak with the encryption key implemented with the data you send while it moves from point A to point B in a cloud.

17. What are some large cloud providers and databases?

Following are the most used large cloud providers and databases:

– Google BigTable

– Amazon SimpleDB

– Cloud-based SQL

18. List the open-source cloud computing platform databases?

Following are the open-source cloud computing platform databases:

– MongoDB

– CouchDB

– LucidDB

19. Explain what is the full form and usage of “EUCALYPTUS” in cloud computing.

“EUCALYPTUS” stands for Elastic Utility Computing Architecture for Linking Your Programs to Useful Systems.

Eucalyptus is an open-source software infrastructure in cloud computing, which enables us to implement clusters in the cloud computing platform. The main application of eucalyptus is to build public, hybrid, and private clouds. Using this, you can produce your personalized data center into a private cloud and leverage it to various other organizations to make the most out of it and use the functionalities offered by eucalyptus.

20. Explain public, static, and void class.

Public: This is an access modifier, it is used to specify who can access a particular method. When you say public, it means that the method is accessible to any given class.

Static: This keyword in Java tells us that it is class-based, this means it can be accessed without creating the instance of any particular class.

Void: Void defines a method which does not return any value. So this is the return related method.

21. Explain the difference between cloud and traditional data centers.

In a traditional data center, the major drawback is the expenditure. A traditional data center is comparatively expensive due to heating, hardware, and software issues. So, not only is the initial cost higher, but the maintenance cost is also a problem.

Cloud being scaled when there is an increase in demand. Mostly the expenditure is on the maintenance of the data centers, while these issues are not faced in cloud computing.

22. List down the three necessary functioning clouds in cloud computing.
– Professional cloud

– Personal cloud

– Performance cloud

23. What are the building blocks in cloud architecture?

– Reference architecture

– Technical architecture

– Deployment operation architecture

– Reference architecture

– Technical architecture

– Deployment operation architecture

24. What do you mean by CaaS?

CaaS is a terminology used in the telecom industry as Communication As a Service. CaaS offers to the enterprise user features such as desktop call control, unified messaging, and desktop faxing.

25. What are the advantages of cloud services?

Following are the main advantages of cloud services:

Cost-saving: It helps in the utilization of investment in the corporate sector. So, it is cost saving.

Scalable and Robust: It helps in developing scalable and robust applications. Previously, the scaling took months, but now, scaling takes less time.

Time-saving: It helps in saving time regarding deployment and maintenance.

26. How can a user gain from utility computing?

Utility computing allows the user to pay only for what they are using. It is a plug-in managed by an organization which decides what type of services has to be deployed from the cloud.

Most organizations prefer a hybrid strategy.

27. Before going for cloud computing platform, what are the essential things to be taken in concern by users?

– Compliance

– Loss of data

– Data storage

– Business continuity

– Uptime

– Data integrity in cloud computing.

28. Give A Brief Introduction Of Windows Azure Operating System.

The Windows Azure operating system is used for cloud services to be run on the Windows Azure Platform. Azure is preferred as it includes the essential features for hosting all the services in the cloud. You also get a runtime environment which consists of a Web Server, Primary Storage, Management services, load balancers among others. The Windows Azure system provides the fabric for development and testing of services before their deployment on the Windows Azure in the cloud.

29. Mention About The Top Cloud Applications Now A Days?

Top cloud computing applications include Google docs which are very fast and secure. There is also a mobile version of google docs so you can access your data from a smartphone. Pixlr and Phoenix, jaycut also are the applications used for cloud computing.

30. What are the different data types used in cloud computing?

There are different data types in cloud computing like emails, contacts, images, blogs, etc. As we know that data is increasing day by day so it is needed for new data types to store these new data. For example, if you want to store video then you need a new data type.

Now if you want to know more, you can enroll in our cloud computing course– with training from industry professionals, use cases, and hands-on projects.

So we wrap up with our questions here, these questions will help you in the interview, All The Best!

Check out the PGP-Cloud Computing program by Great Learning, with 3 Million+ hours of learning delivered, 5000+ alumni, 300+ industry experts, and 8 top-ranked programs, Great Learning is among the top-ranked institution for analytics.

Get in touch with us for further details and don’t forget to mention your questions in the comments section, we will get back to you with the most industry-relevant answer.

 

15 Most Important Cybersecurity Interview Questions

Reading Time: 5 minutes

Cybersecurity is a vast domain and there are a wide variety of questions that could be asked during an interview. Recruiters mostly focus on the technical aspects and knowledge of tools and techniques to ensure a secure framework. Here are a few commonly asked cybersecurity interview questions that you might face while seeking jobs in the cybersecurity domain.

 

  1. What is data leakage and what causes it?

The unauthorized transmission of data from within an organization to an external entity or destination is known as data leakage.

The many factors that contribute to data leakage are: 

– Weak passwords

– Theft of company assets 

– The exploitation of vulnerabilities by Hackers

– Accidental e-mails 

– Malicious attacks

– Loss of paperwork

– Phishing

– System errors or misconfiguration

– Inadequate security features for shared drives and documents

– Unsecured back-up

 

  1. How can data be safeguarded?

     

    – Data Loss Prevention Software

    – Email Encryption

    – Training employees on password implementation

    – Two-Factor Authentication

    – Using Virtual Private Networks

    – Monitor and regularize usage of physical devices

    – Periodic Reviews of IT Infrastructure

    – Regularly update cyber-security policies

    – Wipe the old devices clean before disposing them

The most common data loss prevention techniques are:

– Encryption

– Cryptographic hashing

– Encoding

– Data fingerprinting (read, hash and store)

 

  1. Explain the threat, vulnerability, and risk?

Vulnerability is the gap or weakness in a security program that could be exploited to acquire unauthorized access to a company’s asset.

Threat is anything that can intentionally or accidentally exploit a vulnerability to damage or destroy an asset.

Risk is the potential of a threat to exploit a vulnerability and destroy or damage an asset. If a system is not secure enough and has the chances of data loss or damage, it’s under high risk.

 

  1. What are the different types of web server vulnerabilities?

Some of the web server vulnerabilities are:

– Misconfiguration

– Default Settings

– Bugs in Operating System or web server

 

5. What is SSL? Is it enough when it comes to encryption?

SSL is not hard data encryption. It is an identity verification technique to understand that the person one is conversing with is in fact who they say they are. SSL and TLS are used almost everywhere and by everyone, and because of this popularity, it faces the risk of being attacked via its implementation and its very known methodology (eg.: The Heartbleed bug). Additional security is required for data-in-transit and data-at-rest, as SSL can be easily stripped in certain circumstances. 

 

  1. Describe the 3 major first steps for securing your Linux server.

The three broad steps to secure a Linux Server are:

Auditing – A server audit is performed to find our obscure issues that can challenge the server security or stability. The system is scanned or audited for security issues using a tool called Lynis. Each category is separately scanned and a hardening index is subsequently provided to the auditor to take further actions. 

Hardening: Once the audit is complete, the system needs to be hardened based on the security level it requires. This process mainly involves taking the right steps against the security issues identified while auditing.

Compliance: Sticking to the policy outline and the technical baseline is an important aspect of security to maintain a common standard for the same.

 

  1. What are the techniques used in preventing a brute force login attack?

There are three techniques to prevent Brute force login attack:

Account Lockout Policy: After a set number of failed attempts the account is locked out until the administrator unlocks it.

Progressive Delays: After three failed login attempts, the account will be locked for a certain time period. With each failed login attempt after this, the lock-out period will keep increasing, hence making it impractical for the automated tools to attempt forced login.

Challenge-response test: This is primarily to prevent automatic submissions on the login page. Tools like free reCaptcha can be used to ask the user to manually input some text or solve a simple problem to ensure that a user is an actual person. 

 

  1. What is Phishing and how can it be prevented?

Phishing is a social engineering attack intended to steal data from users. The data attacked is usually the login credentials, credit card numbers, and bank account details with an intention to deceit or scam users. The social engineer impersonates genuine web pages and asks for login and other details. 

Some of the ways to prevent phishing are: 

– Two-factor Authentication involving two identity confirmation methods

– Filters to flag high-risk e-mails

– Augmented password logins using identity cues

– Train your employees to beware of certain tell-tail e-mails, and on information sharing tactics

– Have a guard against Spam

 

  1. What is a CIA triad?

It is a standard for implementing Information Security and is common across various types of systems and/or across organizations.

cybersecurity interview questions

Confidentiality: Only the concerned audience can access the data.

Integrity: Ensures that data is kept intact without any foul play in the middle

Availability: Of data and computers to authorized parties, as needed

 

  1. Explain SSL encryption

Secured Sockets Layer is the standard to establish an encrypted link between a browser and a web server. It secures the data exchanged between the web server and the browser, and keeps it private and integral. SSL is the industry standard to protect online transactions between businesses and their respective customers and is used by millions of websites. 

 

  1. What are salted hashes?

A password is protected in a system by creating a hash value of that password. A ‘salt’ is a random number which is added to this hash value and stored in the system. This helps against the dictionary attacks.

 

  1. What are some common cyber-attacks?

Some of the most common cyber-attacks are:

– Phishing

– Malware

– Password Attacks

– DDoS

– Man in the Middle

– Drive-By Downloads

– Malvertising

– Rogue Software

 

  1. How does tracert or tracerout work?

These are used to determine the route from the host computer to a remote machine. They also identify how packets are redirected, if they take too long to traverse, and the number of hops used to send traffic to a host. 

 

  1. What is the difference between symmetric and asymmetric encryption?

In symmetric encryption, a single key is used for both encryption and decryption. While asymmetric encryption uses different keys. Also, symmetric is much faster but is more difficult to implement as compared to asymmetric. 

 

  1. Is it possible to login to Active Directory from Linux or Mac box?

Yes, it is possible to access the active directory from a Linux or a Mac box system by using the Samba program for implementing the SMB protocol. Depending on the version, this allows for share access, printing, or even Active Directory membership. 

 

Stay tuned to this page for more such information on cybersecurity interview questions and career assistance. If you are not confident enough yet and want to prepare more to grab your dream job in the field of Cyber-Security, upskill with Advanced Computer Security Program: A program by Stanford Center for Professional Development, delivered and supported by Great Learning.

 

 

 

 

 

Linear Regression for Beginners – Machine Learning

Reading Time: 4 minutes

Before learning about linear regression, let us get ourselves accustomed to regression. Regression is a method of modelling a target value based on independent predictors. This method is mostly used for forecasting and finding out cause and effect relationship between variables. Regression techniques mostly differ based on the number of independent variables and the type of relationship between the independent and dependent variables.

Linear Regression for Beginners

Simple linear regression is a type of regression analysis where there is just one independent variable and there is a linear relationship between the independent(x) and dependent(y) variable. The red line in the above graph is referred to as the best fit straight line. Based on the given data points, we try to plot a line that models the points the best.

The line can be modelled based on the linear equation shown below.

Y= β0 + β1x + €

Isn’t Linear Regression from Statistics?

Before we dive into the details of linear regression, you may be asking yourself why we are looking at this algorithm.

Isn’t it a technique from statistics?

Machine learning, more specifically the field of predictive modelling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. In applied machine learning, we will borrow and reuse algorithms from many different fields, including statistics and use them towards these ends.

As such, linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but has been borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm.

Linear Regression Model Representation

Linear regression is an attractive model because the representation is so simple.

The representation is a linear equation that combines a specific set of input values (x) the solution to which is the predicted output for that set of input values (y). As such, both the input values (x) and the output value are numeric.

The linear equation assigns one scale factor to each input value or column, called a coefficient and represented by the capital Greek letter Beta (B). One additional coefficient is also added, giving the line an additional degree of freedom (e.g. moving up and down on a two-dimensional plot) and is often called the intercept or the bias coefficient.

For example, in a simple regression problem (a single x and a single y), the form of the model would be:

Y= β0 + β1x

In higher dimensions when we have more than one input (x), the line is called a plane or a hyper-plane. The representation, therefore, is in the form of the equation and the specific values used for the coefficients (e.g. β0and β1 in the above example).

Linear Regression – Learning the Model

  1. Simple Linear Regression

With simple linear regression when we have a single input, we can use statistics to estimate the coefficients.

This requires that you calculate statistical properties from the data such as mean, standard deviation, correlation, and covariance. All of the data must be available to traverse and calculate statistics.

  1. Ordinary Least Squares

When we have more than one input we can use Ordinary Least Squares to estimate the values of the coefficients.

The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. This means that given a regression line through the data we calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seek to minimize.

  1. Gradient Descent

This operation is called Gradient Descent and works by starting with random values for each coefficient. The sum of the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

When using this method, you must select a learning rate (alpha) parameter that determines the size of the improvement step to take on each iteration of the procedure.

  1. Regularization

There are extensions to the training of the linear model called regularization methods. These seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:

– Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).

– Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).

Preparing Data for Linear Regression

Linear regression has been studied at great length, and there is a lot of literature on how your data must be structured to make the best use of the model.  In practice, you can use these rules more like rules of thumb when using Ordinary Least Squares Regression, the most common implementation of linear regression.

Try different preparations of your data using these heuristics and see what works best for your problem.

– Linear Assumption

– Noise Removal

– Remove Collinearity

– Gaussian Distributions

– Rescale Inputs

 

Summary

In this post, you discovered the linear regression algorithm for machine learning.

You covered a lot of ground including:

– The common names used when describing linear regression models.

– The representation used by the model.

– Learning algorithms used to estimate the coefficients in the model.

– Rules of thumb to consider when preparing data for use with linear regression.

 

Try out linear regression and get comfortable with it. If you are planning a career in Machine Learning, here are some Must-Haves On Your Resume and most common interview questions to prepare.

Hypothesis Testing in R- with examples and case study

Reading Time: 10 minutes

Hypothesis Testing is an essential part of statistics for Data Analytics. It is a statistical method used to make decisions based on statistics drawn from experimental data. In simple terms, it can also be called an educated claim or statement about a property or population.

The goal of Hypothesis Testing in R is to analyze a sample in an attempt to distinguish between population characteristics, that are likely to occur and population characteristics that are unlikely to occur.

Key terms and concepts-

Null Hypothesis H0:

A null hypothesis is the status quo

A general statement or default position that there is no relationship between two measured phenomena or no association among groups.

Alternate Hypothesis H1:

The alternative hypothesis is contrary to the Null Hypothesis.

It is usually believed that the observations are the result of a real effect.


Null and Alternative Hypotheses Examples:

Industry Null Hypothesis Alternate Hypothesis
Process Industry Shop Floor Manager in Dairy Company feels that the Milk Packaging Process unit for 1 Litre Packs is working fine and does not need any calibration. SD = 10 ml Null Hypothesis : µ = 1 Alternate Hypothesis: µ ≠ 1
Credit Risk Credit Team of a Bank has been taking lending decisions based on in-house developed Credit Scorecard. Their claim to fame in the organization is their scorecard has helped reduce NPAs by at least 0.25%

Null Hypothesis: Scorecard has not helped in reducing NPA level π (scorecard NPA) – π (No scorecard NPA) = 0.25%

Alternate Hypothesis : π (scorecard NPA) – π (No scorecard NPA) > 0.25%
Motor Industry An Electric Car manufacturer claims their newly launched e-car gives an average mileage of 125 MPGe (Miles per Gasoline Equivalent)

Null Hypothesis : µ = 125

Alternate Hypothesis: µ < 125

 

Type I and Type II Error

Null Hypothesis True FALSE
Reject Type I Error (α) No error
Accept No error Type II Error (β)

 

A Null hypothesis is rejected when it is True. This is Type 1 Error

Eg: A manufacturer’s Quality Control department rejects a lot when it has met the market acceptable quality level. This is the Producer’s Risk.

 

Type I and Type II Error

Null Hypothesis True FALSE
Reject Type I Error (α) No error
Accept No error Type II Error (β)

 

The Null Hypothesis is accepted when it is False. This is Type II Error

E.g. A Consumer accepts a lot when it is faulty. This is Consumer’s Risk

 

Type I and Type II Error

Hypthesis testing in R 1 - Type 1 and Type 2 error

 

Hypotheis testing in R2 - Type A and Type B error

P-value:

In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was observed, assuming that the null hypothesis is correct.

Significance value:

The probability of rejecting the null hypothesis is when it is called the significance level α.

Note: If the P-value is equal to or smaller than the significance level(α), it suggests that the observed data is inconsistent with the assumption that the null hypothesis is correct and thus this hypothesis must be rejected (but this does not automatically mean the alternative hypothesis can be accepted as true).

α is the probability of Type I error and β is the probability of Type ll error.

The experimenters have the freedom to set the α level for a particular hypothesis test.

That level is called the level of significance for the test. Changing α can (and often does) affect the results of the test- whether you reject or fail to reject H0.

As α increases, β decreases and vice versa.

The only way to decrease both α and β is to increase the sample size. To make both quantities equal to zero, the sample size would have to be infinite, and you would have to sample the entire population.

Possible Errors while HT

The confidence coefficient (1-α) is the probability of not rejecting H0 when it is True.

The Confidence level of a hypothesis test is (1-α) * 100%.

The power of a statistical test (1-β) is the probability of rejecting H0 when it is false.

How to correct errors in HT 

Steps to follow:

Determine a P-value when testing a Null Hypothesis

If the alternative hypothesis is less than the alternative, you reject H0 only if the test statistic falls in the left tail of the distribution (below 2). Similarly, if H1 is higher than the alternative, you reject the H0 only if the test statistic falls in the right tail (above 2)

Hypothesis Testing in R

Upper tailed, Lower Tailed, Two-Tailed Tests

H1: µ >  µ0, where µ0 is the comparator or null value and an increase is hypothesized –this type of test is called an upper-tailed test.

H1: µ <  µ0, where a decrease is hypothesized and this type of test is called a lower-tailed test.

H1: µ ≠ µ0, where a difference is hypothesized and this type of test is called a two-tailed test.

 

Types of T-test in Hypothesis Test:

One sample t-test – One sample t-test is used to compare the mean of a population to a specified theoretical mean µ

The unpaired two-sample t-test (Independent t-test) – Independent (or unpaired two-sample) t-test is used to compare the means of two unrelated groups of samples.

Paired t-test – Paired Student’s t-test is used to compare the means of two related samples. That is when you have two values (a pair of values) for the same samples.

Let’s Look at some Case studies:

t-Test Application One Sample

Experience Marketing Services reported that the typical American spends a mean of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device. (Source: The 2014 Digital Marketer, available at ex.pn/1kXJifX.) To test the validity of this statement, you select a sample of 30 friends and family. The result for the time spent per day accessing the Internet via a mobile device (in minutes) are stored in Internet_Mobile_Time.csv file.

Is there evidence that the populations mean time spent per day accessing the Internet via a mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05

What assumption about the population distribution is needed to conduct the test in A?

Solution In R

>setwd("D:/Hypothesis")

> Mydata=read.csv("InternetMobileTime.csv", header = TRUE)

Hypothesis Testing in R

> attach(mydata)

> xbar=mean(Minutes)

> s=sd(Minutes)

> n=length(Minutes)

> Mu=144 #null hypothesis

> tstat=(xbar-Mu)/(s/(n^0.5))

> tstat

[1] 1.224674

> Pvalue=2*pt(tstat, df=n-1, lower=FALSE)

> Pvalue

[1] 0.2305533

> if(Pvalue<0.05)NullHypothesis else "Accepted"

[1] “Accepted”

  1. Independent t-test two sample

A hotel manager looks to enhance the initial impressions that hotel guests have when they check-in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day was selected each from Wing A and Wing B of the hotel. The data collated is given in Luggage.csv file. Analyze the data and determine whether there is a difference in the mean delivery times in the two wings of the hotel. (use alpha = 0.05).

Solution In R

Hypothesis Testing in R

> t.test(WingA,WingB, var.equal = TRUE, alternative = "greater")

    Two Sample t-test

data:  WingA and WingB

t = 5.1615,

df = 38,

p-value = 4.004e-06

alternative hypothesis: true difference in means is greater than 0

95 percent confidence interval:

1.531895   Inf

sample estimates:

mean of x mean of y

 10.3975 8.1225

> t.test(WingA,WingB)

   Welch Two Sample t-test

data:  WingA and WingB

t = 5.1615, df = 37.957, p-value = 8.031e-06

alternative hypothesis: true difference in means is not equal to 0

95 per cent confidence interval:

1.38269 3.16731

sample estimates:

mean of x mean of y

 10.3975 8.1225

boxplot(WingA,WingB, col = c("Red","Pink"), horizontal = TRUE)

Hypothesis Testing in R 

Case Study- Titan Insurance Company

The Titan Insurance Company has just installed a new incentive payment scheme for its lift policy salesforce. It wants to have an early view of the success or failure of the new scheme. Indications are that the sales force is selling more policies, but sales always vary in an unpredictable pattern from month to month and it is not clear that the scheme has made a significant difference.

Life Insurance companies typically measure the monthly output of a salesperson as the total sum assured for the policies sold by that person during the month. For example, suppose salesperson X has, in the month, sold seven policies for which the sums assured are £1000, £2500, £3000, £5000, £10000, £35000. X’s output for the month is the total of these sums assured, £61,500. Titan’s new scheme is that the sales force receives low regular salaries but are paid large bonuses related to their output (i.e. to the total sum assured of policies sold by them). The scheme is expensive for the company, but they are looking for sales increases which more than compensate. The agreement with the sales force is that if the scheme does not at least break even for the company, it will be abandoned after six months.

The scheme has now been in operation for four months. It has settled down after fluctuations in the first two months due to the changeover.

To test the effectiveness of the scheme, Titan has taken a random sample of 30 salespeople measured their output in the penultimate month before changeover and then measured it in the fourth month after the changeover (they have deliberately chosen months not too close to the changeover). The outputs of the salespeople are shown in Table 1

SALESPERSON Old_Scheme New_Scheme
1 57 62
2 103 122
3 59 54
4 75 82
5 84 84
6 73 86
7 35 32
8 110 104
9 44 38
10 82 107
11 67 84
12 64 85
13 78 99
14 53 39
15 41 34
16 39 58
17 80 73
18 87 53
19 73 66
20 65 78
21 28 41
22 62 71
23 49 38
24 84 95
25 63 81
26 77 58
27 67 75
28 101 94
29 91 100
30 50 68

 

Data preparation

Since the given data are in 000, it will be better to convert them in thousands.

Hypothesis Testing in R

Problem 1

Describe the five per cent significance test you would apply to these data to determine whether the new scheme has significantly raised outputs? What conclusion does the test lead to?

Solution:

It is asked that whether the new scheme has significantly raised the output, it is an example of the one-tailed t-test.

Note: Two-tailed test could have been used if it was asked “new scheme has significantly changed the output”

Mean of amount assured before the introduction of scheme = 68450

Mean of amount assured after the introduction of scheme = 72000

Difference in mean = 72000 – 68450 = 3550

Let,

μ1 = Average sums assured by salesperson BEFORE changeover. μ2 = Average sums assured by salesperson AFTER changeover.

H0: μ1 = μ2  ; μ2 – μ1 = 0

HA: μ1 < μ2   ; μ2 – μ1 > 0 ; true difference of means is greater than zero.

Since population standard deviation is unknown, paired sample t-test will be used.

Hypothesis Testing in R

 

Since p-value (=0.06529) is higher than 0.05, we accept (fail to reject) NULL hypothesis. The new scheme has NOT significantly raised outputs.

Problem 2

Suppose it has been calculated that for Titan to break even, the average output must increase by £5000. If this figure is an alternative hypothesis, what is:

(a)  The probability of a type 1 error?

(b)  What is the p-value of the hypothesis test if we test for a difference of $5000?

(c)   Power of the test:

Solution:

2.a.  The probability of a type 1 error?

Solution: Probability of Type I error = significant level = 0.05 or 5%

2.b.  What is the p-value of the hypothesis test if we test for a difference of $5000?

Solution:

Let  μ2 = Average sums assured by salesperson AFTER changeover.

μ1 = Average sums assured by salesperson BEFORE changeover.

μd = μ2 – μ1   H0: μd ≤ 5000 HA: μd > 5000

This is a right tail test.

R code:

Hypothesis Testing in R

 

P-value = 0.6499

2.c. Power of the test.

Solution:

Let  μ2 = Average sums assured by salesperson AFTER changeover. μ1 = Average sums assured by salesperson BEFORE changeover. μd = μ2 – μ1   H0: μd = 4000

HA: μd > 0

H0 will be rejected if test statistics > t_critical.

With α = 0.05 and df = 29, critical value for t statistic (or t_critical ) will be   1.699127.

Hence, H0 will be rejected for test statistics ≥  1.699127.

Hence, H0 will be rejected if for  𝑥̅ ≥ 4368.176

Graphically,

Hypothesis Testing in R

R Code:

Hypothesis Testing in R

      Probability (type II error) is P(Do not reject H0 | H0 is false)

      Our NULL hypothesis is TRUE at μd = 0 so that  H0: μd = 0 ; HA: μd > 0

      Probability of type II error at μd = 5000

= P (Do not reject H0 | H0 is false)

= P (Do not reject H0 | μd = 5000)  = P (𝑥̅ < 4368.176 | μd = 5000)

= P (t <  | μd = 5000)

= P (t < -0.245766)

= 0.4037973

R Code:

Hypothesis Testing in R

Hypothesis Testing in R

 

Now,  β=0.5934752,

Power of test = 1- β = 1- 0.5934752

= 0.4065248

 

Conclusion:

Hypothesis testing in R should always explain what is expected to happen. It contains both an independent and dependent variable. It should be testable and measurable,  but it may or may not be correct. You have to ascertain the truth of the hypothesis by using Hypothesis Testing.

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps.

  • Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternate hypothesis (commonly, that the observations show a real effect combined with a component of chance variation).
  • Identify a test statistic that can be used to assess the truth of the null hypothesis
  • Compute the p-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis was correct. The smaller the p-value, the stronger the evidence against the null hypothesis.
  • Compare the p-value to an acceptable significance value a (sometimes called an alpha value). If p <= a, then the observed effect is statistically significant, i.e., the null hypothesis is ruled out, and the alternative hypothesis is valid.

 

Your essential weekly guide to Data Science and Analytics – August 7

Reading Time: 2 minutes

Data Science and Business Analytics are now an integral part of today’s businesses. Top organizations and experts are now deciding on ways to make it a way of life at workplaces, and effectively promote these new roles among budding professionals. Read the articles below to get more information on Data Science and Analytics updates.

Accenture India May Expand Data Science Team to Make it a Hub For APAC

Consulting and technology services provider Accenture India may expand its data science team to make it a hub serving the Asia-Pacific. To be sure, the proposal is still being discussed, said Anindya Basu, the country managing director of Accenture India and a member of its APAC management committee. To make this a data sciences hub, the workforce expansion will have to be significant, he said.

6 Ways To Become More Marketable As a Data Scientist

Data Science is without a doubt “the sexiest job of the 21st century”, and with time, the community is just getting bigger and better. In this article, we are going to see how you can market yourself better as a data scientist and leave a mark in the community.

Data Science is Changing The Way Businesses Compete And Operate

A number of businesses today are leaning towards data-empowered approaches and embracing unique and innovative methods to attain success. China, the US, Switzerland, Canada, and Australia are the top five countries which lead the data science adoption in recent years. Other countries which are levelling up their game to win the race are Japan, Sweden, Singapore, UK, Netherlands, Germany, and India.

Data Ethics Issues Create Minefields For Analytics Teams

IT and analytics teams need to be guided by a framework of ethics rules and motivated by management to put those rules into practice. Otherwise, a company runs the risk of crossing the line in mining and using personal data — and, typically, not as the result of a nefarious plan to do so. 

Chief Data Officers Urged to Master Product Management

According to Gartner, a product-centric data and analytics organisation requires new skill sets, roles, investment models and the right culture. Gartner recommended that CDOs who want to help their organisation transform into one that uses ongoing data-driven product development, start with a data and analytics platform, then define a release plan and roadmap. After that, a team can be built to make the delivery cycle happen.

 

Happy Reading!

Your essential weekly guide to Artificial Intelligence – August 7

Reading Time: 2 minutes

With some of the recent developments and applications, it is safe to say that Artificial Intelligence is becoming more human-like. Some of the world’s largest organizations and start-ups are working towards developing solutions like the ones mentioned below to make this happen. Read along to know more!

With $1 Billion From Microsoft, an A.I. Lab Wants to Mimic The Brain

Sam Altman and his team of researchers hope to build artificial general intelligence, or A.G.I., a machine that can do anything the human brain can do. A.G.I. still has a whiff of science fiction. But in their agreement, Microsoft and OpenAI discuss the possibility with the same matter-of-fact language they might apply to any other technology they hope to build.

Dasha AI is Calling so You Don’t Have To

The team at Dasha AI is building a platform for designing human-like voice interactions to automate business processes. Put simply, it’s using AI to make machine voices a whole lot less robotic.

EmoNet: Emotional Neural Network Automatically Categorises Feelings

A neural network called EmoNet has been designed to automatically categorise the feelings of an individual. EmoNet was created by researchers from the University of Colorado and Duke University and could one day help AIs to understand and react to human emotions.

War Cloud: Why is The US Military Building a Vast AI System And How Will It Change The Battlefield Of The Future?

The Jedi project of war cloud would involve establishing a vast cloud computing system, whereby a network of remote servers hosted on the internet store and process data. The war cloud would use these servers to store classified military data, while also providing the computing power to enable AI-based war planning.

Hybrid Chip Paves Way For ‘Thinking Machines’

Chinese researchers have developed a hybrid chip architecture that could move the world a step closer to achieving artificial general intelligence (AGI) and a future filled with humanlike “thinking machines”.

 

Happy Reading!

Great Lakes’ PGP-DSBA Provides the Ultimate Flexibility to Pursue Data Science & Analytics

Reading Time: 2 minutes

Great Lakes’ Post-Graduate Program in Data Science and Business Analytics (PGP-DSBA) offers maximum flexibility to facilitate learning to professionals, without the need of compromising on their ongoing routine and commitments. Here are the 4 flexibility options offered:

  1. Flexibility to take up missed sessions – Learning new skills to transition your career along with the existing professional and personal commitments could be a daunting experience. But we make it easier with the opportunity to access your missed lectures any time you can and want. Our Learning Management system Olympus helps you access recorded sessions and weekend classes 24/7. This gives you the freedom of going back to the sessions over and over again for better learning. With Olympus, you don’t have to worry about missing a session – ever again!
  2. Flexibility to accommodate transfer cases – For the ones who frequently travel and have last-minute commitments cropping up regularly, being present at the location of the program all the time becomes a challenge. We accommodate all the cases of inter-city location change. You can continue taking up classes with the batch that is active in the city of your travel. If there are no active batches in the city of your travel, or this option doesn’t work for you due to any other reason, our very next flexibility option surely will !!
  3. Flexibility to accommodate sabbaticals from our batch – If your company is sending you abroad for a couple of months, or if you have any other valid commitment that needs you to be away, this option is for you. Great Learning provides you with an option to leave the program mid-way, and resume when you are back with another batch. Also, there is no maximum cap on the hold period. Whether it is two months or 6 months, we accommodate all cases.
  4. Flexibility in payment options – At Great Learning, we believe that the financial constraints should not become a road-block in the learning path. Therefore, we offer flexible payment options for all our applicants. We have easy instalment options that reduce the stress of paying the course fee upfront. It also helps you to plan your finances for the year ahead of joining the course. With these options, you can focus more on learning rather than stressing about the fee payment. We have partnered with HDFC Credilla, Zest Money, and Avance education for easy loan options.

Great Learning’s PG program in Data Science and Business Analytics is hands down the most flexible course there is. Don’t wait for a perfect time, as there never will be. Join us now!

 

Disclaimer: Information is subject to change without prior notice.

Great Lakes’ PG Program – Data Science & Business Analytics Selection Process Demystified

Reading Time: 2 minutes

Every year, thousands of professionals apply to our Data Science programs. When you enrol in PGP-DSBA (PG program in Data Science and Business Analytics), we believe it is our responsibility to give enough time and attention to every candidature even if it means going through hundreds of profiles to select one professional that could benefit from our program. Here are the details of the selection process:

  1. Online Application Form – The application process is quite simple and not at all time-consuming as it hardly takes 10 minutes to be done with your application. To apply, you need to fill up a simple Online Application Form. You will have to create an account before filling up this form or log-in if you are registered with us already. It is advised that you keep the details of all your professional experience and educational background handy, and think beforehand about why you want to learn Data Science? When you write about your professional and educational background, you should elaborate on your experience and mention even the minutest of details. Once you are done with filling all your details you will be prompted with a question: “Why do you want to learn Data Science?” to complete your application.
  2. Shortlisting by Panel Review – Suitable candidates for this course are shortlisted by a panel of 2 to 3 faculty members. They evaluate the applicants based on their work and educational profile which includes, but is not limited to the field of specialization in graduation, aggregate score in graduation/ post-graduation, work experience till the date of application, and relevance to the course. Please do not miss to check the eligibility criteria before you apply.
  3. Interview/Screening – We receive thousands of applications for a single batch and only a few are selected. This is done through a screening process that involves a telephonic interview. As the number of applications is quite large, it becomes difficult for us to objectively process all the information about each candidate. Hence, we evaluate your candidature with the help of a set of analytical tools and shortlist profiles for the interview process.
  4. Offer Stage – After a careful review by the faculty, each candidate is assigned a score based on interview and faculty score. The set of applicants with the highest total scores are offered to pursue the program.

Great Learning’s PGP-DSBA is one of the most comprehensive courses in Data Science and Business Analytics, and this is the reason for a steady selection rate of 1 out of 10 applications. Admissions are constantly conducted on a rolling basis and the process is closed once we select the required number of candidates for a said batch. Keep a track of our closing deadline as the batches fill up fast and way before the closing date. If you miss applying for a particular batch on time, you can always keep a track of the details about the next batch.

Join PGP-DSBA now for a rewarding career in Business Analytics and Data Science!

 

Payment Options – Online PG Program in Data Science and Business Analytics (PGP-DSBA Online)

Reading Time: 1 minute

Great Lakes’ Online PG Program in Data Science and Business Analytics (PGP-DSBA Online) provides the most flexible payment options to reduce your financial stress while deciding on upskilling. With us, you need not worry about the financing options, and just need to focus on learning and output. Our program offers the below payment structure for maximum flexibility:

  1. Easy Installments – Initially you just need to pay the admission fee to confirm your candidature. The rest of the amount is divided into 3 equal instalments to minimize the stress of one-time payment. Below is the payment breakup:

    Instalments

    Admission Fee

    25,000

    1st Installment

    58,333

    2nd Installment

    58,333

    3rd Installment

    58,334

    Total

    2,00,000

  2. Multiple Payment Options – We offer multiple payment options including payment through debit card, credit card, net banking, demand drafts and cheques. Fuss-free, right?
  3. Pre-Approved Education Loans – We do not want the financial aspect becoming a road-block in your learning process. Therefore, we have partnered with various third party lenders like HDFC Credila, Avance Education, and Zest Money (0% EMI) providing a substantially lower interest rate as compared to other financial institutions in the market.
  4. Fee Waiver on One-Time Payment – A fee waiver up to INR 10,000/- is given if someone chooses to pay the full program fee in one go.

With the online PGP-DSBA’s Flexi-payment options, you no longer need to wait for a fulfilling and rewarding career in Business Analytics. Join us now!