Hypothesis Testing in R- with examples and case study

Reading Time: 10 minutes

Hypothesis Testing is an essential part of statistics for Data Analytics. It is a statistical method used to make decisions based on statistics drawn from experimental data. In simple terms, it can also be called an educated claim or statement about a property or population.

The goal of Hypothesis Testing in R is to analyze a sample in an attempt to distinguish between population characteristics, that are likely to occur and population characteristics that are unlikely to occur.

Key terms and concepts-

Null Hypothesis H0:

A null hypothesis is the status quo

A general statement or default position that there is no relationship between two measured phenomena or no association among groups.

Alternate Hypothesis H1:

The alternative hypothesis is contrary to the Null Hypothesis.

It is usually believed that the observations are the result of a real effect.


Null and Alternative Hypotheses Examples:

IndustryNull HypothesisAlternate Hypothesis
Process IndustryShop Floor Manager in Dairy Company feels that the Milk Packaging Process unit for 1 Litre Packs is working fine and does not need any calibration. SD = 10 ml Null Hypothesis : µ = 1Alternate Hypothesis: µ ≠ 1
Credit RiskCredit Team of a Bank has been taking lending decisions based on in-house developed Credit Scorecard. Their claim to fame in the organization is their scorecard has helped reduce NPAs by at least 0.25%

Null Hypothesis: Scorecard has not helped in reducing NPA level π (scorecard NPA) – π (No scorecard NPA) = 0.25%

Alternate Hypothesis : π (scorecard NPA) – π (No scorecard NPA) > 0.25%
Motor IndustryAn Electric Car manufacturer claims their newly launched e-car gives an average mileage of 125 MPGe (Miles per Gasoline Equivalent)

Null Hypothesis : µ = 125

Alternate Hypothesis: µ < 125

 

Type I and Type II Error

Null HypothesisTrueFALSE
RejectType I Error (α)No error
AcceptNo errorType II Error (β)

 

A Null hypothesis is rejected when it is True. This is Type 1 Error

Eg: A manufacturer’s Quality Control department rejects a lot when it has met the market acceptable quality level. This is the Producer’s Risk.

 

Type I and Type II Error

Null HypothesisTrueFALSE
RejectType I Error (α)No error
AcceptNo errorType II Error (β)

 

The Null Hypothesis is accepted when it is False. This is Type II Error

E.g. A Consumer accepts a lot when it is faulty. This is Consumer’s Risk

 

Type I and Type II Error

Hypthesis testing in R 1 - Type 1 and Type 2 error

 

Hypotheis testing in R2 - Type A and Type B error

P-value:

In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was observed, assuming that the null hypothesis is correct.

Significance value:

The probability of rejecting the null hypothesis is when it is called the significance level α.

Note: If the P-value is equal to or smaller than the significance level(α), it suggests that the observed data is inconsistent with the assumption that the null hypothesis is correct and thus this hypothesis must be rejected (but this does not automatically mean the alternative hypothesis can be accepted as true).

α is the probability of Type I error and β is the probability of Type ll error.

The experimenters have the freedom to set the α level for a particular hypothesis test.

That level is called the level of significance for the test. Changing α can (and often does) affect the results of the test- whether you reject or fail to reject H0.

As α increases, β decreases and vice versa.

The only way to decrease both α and β is to increase the sample size. To make both quantities equal to zero, the sample size would have to be infinite, and you would have to sample the entire population.

Possible Errors while HT

The confidence coefficient (1-α) is the probability of not rejecting H0 when it is True.

The Confidence level of a hypothesis test is (1-α) * 100%.

The power of a statistical test (1-β) is the probability of rejecting H0 when it is false.

How to correct errors in HT 

Steps to follow:

Determine a P-value when testing a Null Hypothesis

If the alternative hypothesis is less than the alternative, you reject H0 only if the test statistic falls in the left tail of the distribution (below 2). Similarly, if H1 is higher than the alternative, you reject the H0 only if the test statistic falls in the right tail (above 2)

Hypothesis Testing in R

Upper tailed, Lower Tailed, Two-Tailed Tests

H1: µ >  µ0, where µ0 is the comparator or null value and an increase is hypothesized –this type of test is called an upper-tailed test.

H1: µ <  µ0, where a decrease is hypothesized and this type of test is called a lower-tailed test.

H1: µ ≠ µ0, where a difference is hypothesized and this type of test is called a two-tailed test.

 

Types of T-test in Hypothesis Test:

One sample t-test – One sample t-test is used to compare the mean of a population to a specified theoretical mean µ

The unpaired two-sample t-test (Independent t-test) – Independent (or unpaired two-sample) t-test is used to compare the means of two unrelated groups of samples.

Paired t-test – Paired Student’s t-test is used to compare the means of two related samples. That is when you have two values (a pair of values) for the same samples.

Let’s Look at some Case studies:

t-Test Application One Sample

Experience Marketing Services reported that the typical American spends a mean of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device. (Source: The 2014 Digital Marketer, available at ex.pn/1kXJifX.) To test the validity of this statement, you select a sample of 30 friends and family. The result for the time spent per day accessing the Internet via a mobile device (in minutes) are stored in Internet_Mobile_Time.csv file.

Is there evidence that the populations mean time spent per day accessing the Internet via a mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05

What assumption about the population distribution is needed to conduct the test in A?

Solution In R

>setwd("D:/Hypothesis")

> Mydata=read.csv("InternetMobileTime.csv", header = TRUE)

Hypothesis Testing in R

> attach(mydata)

> xbar=mean(Minutes)

> s=sd(Minutes)

> n=length(Minutes)

> Mu=144 #null hypothesis

> tstat=(xbar-Mu)/(s/(n^0.5))

> tstat

[1] 1.224674

> Pvalue=2*pt(tstat, df=n-1, lower=FALSE)

> Pvalue

[1] 0.2305533

> if(Pvalue<0.05)NullHypothesis else "Accepted"

[1] “Accepted”

  1. Independent t-test two sample

A hotel manager looks to enhance the initial impressions that hotel guests have when they check-in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day was selected each from Wing A and Wing B of the hotel. The data collated is given in Luggage.csv file. Analyze the data and determine whether there is a difference in the mean delivery times in the two wings of the hotel. (use alpha = 0.05).

Solution In R

Hypothesis Testing in R

> t.test(WingA,WingB, var.equal = TRUE, alternative = "greater")

    Two Sample t-test

data:  WingA and WingB

t = 5.1615,

df = 38,

p-value = 4.004e-06

alternative hypothesis: true difference in means is greater than 0

95 percent confidence interval:

1.531895   Inf

sample estimates:

mean of x mean of y

 10.3975 8.1225

> t.test(WingA,WingB)

   Welch Two Sample t-test

data:  WingA and WingB

t = 5.1615, df = 37.957, p-value = 8.031e-06

alternative hypothesis: true difference in means is not equal to 0

95 per cent confidence interval:

1.38269 3.16731

sample estimates:

mean of x mean of y

 10.3975 8.1225

boxplot(WingA,WingB, col = c("Red","Pink"), horizontal = TRUE)

Hypothesis Testing in R 

Case Study- Titan Insurance Company

The Titan Insurance Company has just installed a new incentive payment scheme for its lift policy salesforce. It wants to have an early view of the success or failure of the new scheme. Indications are that the sales force is selling more policies, but sales always vary in an unpredictable pattern from month to month and it is not clear that the scheme has made a significant difference.

Life Insurance companies typically measure the monthly output of a salesperson as the total sum assured for the policies sold by that person during the month. For example, suppose salesperson X has, in the month, sold seven policies for which the sums assured are £1000, £2500, £3000, £5000, £10000, £35000. X’s output for the month is the total of these sums assured, £61,500. Titan’s new scheme is that the sales force receives low regular salaries but are paid large bonuses related to their output (i.e. to the total sum assured of policies sold by them). The scheme is expensive for the company, but they are looking for sales increases which more than compensate. The agreement with the sales force is that if the scheme does not at least break even for the company, it will be abandoned after six months.

The scheme has now been in operation for four months. It has settled down after fluctuations in the first two months due to the changeover.

To test the effectiveness of the scheme, Titan has taken a random sample of 30 salespeople measured their output in the penultimate month before changeover and then measured it in the fourth month after the changeover (they have deliberately chosen months not too close to the changeover). The outputs of the salespeople are shown in Table 1

SALESPERSONOld_SchemeNew_Scheme
15762
2103122
35954
47582
58484
67386
73532
8110104
94438
1082107
116784
126485
137899
145339
154134
163958
178073
188753
197366
206578
212841
226271
234938
248495
256381
267758
276775
2810194
2991100
305068

 

Data preparation

Since the given data are in 000, it will be better to convert them in thousands.

Hypothesis Testing in R

Problem 1

Describe the five per cent significance test you would apply to these data to determine whether the new scheme has significantly raised outputs? What conclusion does the test lead to?

Solution:

It is asked that whether the new scheme has significantly raised the output, it is an example of the one-tailed t-test.

Note: Two-tailed test could have been used if it was asked “new scheme has significantly changed the output”

Mean of amount assured before the introduction of scheme = 68450

Mean of amount assured after the introduction of scheme = 72000

Difference in mean = 72000 – 68450 = 3550

Let,

μ1 = Average sums assured by salesperson BEFORE changeover. μ2 = Average sums assured by salesperson AFTER changeover.

H0: μ1 = μ2  ; μ2 – μ1 = 0

HA: μ1 < μ2   ; μ2 – μ1 > 0 ; true difference of means is greater than zero.

Since population standard deviation is unknown, paired sample t-test will be used.

Hypothesis Testing in R

 

Since p-value (=0.06529) is higher than 0.05, we accept (fail to reject) NULL hypothesis. The new scheme has NOT significantly raised outputs.

Problem 2

Suppose it has been calculated that for Titan to break even, the average output must increase by £5000. If this figure is an alternative hypothesis, what is:

(a)  The probability of a type 1 error?

(b)  What is the p-value of the hypothesis test if we test for a difference of $5000?

(c)   Power of the test:

Solution:

2.a.  The probability of a type 1 error?

Solution: Probability of Type I error = significant level = 0.05 or 5%

2.b.  What is the p-value of the hypothesis test if we test for a difference of $5000?

Solution:

Let  μ2 = Average sums assured by salesperson AFTER changeover.

μ1 = Average sums assured by salesperson BEFORE changeover.

μd = μ2 – μ1   H0: μd ≤ 5000 HA: μd > 5000

This is a right tail test.

R code:

Hypothesis Testing in R

 

P-value = 0.6499

2.c. Power of the test.

Solution:

Let  μ2 = Average sums assured by salesperson AFTER changeover. μ1 = Average sums assured by salesperson BEFORE changeover. μd = μ2 – μ1   H0: μd = 4000

HA: μd > 0

H0 will be rejected if test statistics > t_critical.

With α = 0.05 and df = 29, critical value for t statistic (or t_critical ) will be   1.699127.

Hence, H0 will be rejected for test statistics ≥  1.699127.

Hence, H0 will be rejected if for  𝑥̅ ≥ 4368.176

Graphically,

Hypothesis Testing in R

R Code:

Hypothesis Testing in R

      Probability (type II error) is P(Do not reject H0 | H0 is false)

      Our NULL hypothesis is TRUE at μd = 0 so that  H0: μd = 0 ; HA: μd > 0

      Probability of type II error at μd = 5000

= P (Do not reject H0 | H0 is false)

= P (Do not reject H0 | μd = 5000)  = P (𝑥̅ < 4368.176 | μd = 5000)

= P (t <  | μd = 5000)

= P (t < -0.245766)

= 0.4037973

R Code:

Hypothesis Testing in R

Hypothesis Testing in R

 

Now,  β=0.5934752,

Power of test = 1- β = 1- 0.5934752

= 0.4065248

 

Conclusion:

Hypothesis testing in R should always explain what is expected to happen. It contains both an independent and dependent variable. It should be testable and measurable,  but it may or may not be correct. You have to ascertain the truth of the hypothesis by using Hypothesis Testing.

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps.

  • Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternate hypothesis (commonly, that the observations show a real effect combined with a component of chance variation).
  • Identify a test statistic that can be used to assess the truth of the null hypothesis
  • Compute the p-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis was correct. The smaller the p-value, the stronger the evidence against the null hypothesis.
  • Compare the p-value to an acceptable significance value a (sometimes called an alpha value). If p <= a, then the observed effect is statistically significant, i.e., the null hypothesis is ruled out, and the alternative hypothesis is valid.

 

21 Open Source Python Libraries You Should Know About

Reading Time: 7 minutes

The probability that you must have heard of ‘Python’ is outright. Guido Van Rossum’s brainchild – Python, which dates back to the ’80s has become an avid game changer. It is one of the most popular coding languages today and is widely used for a gamut of applications. In this article, we have listed 21 Open Source Python Libraries you should know about.

What is a Library

A library is a collection of pre-combined codes that can be used iteratively to reduce the time required to code. They are particularly useful for accessing the pre-written frequently used codes, instead of writing them from scratch every single time. Similar to the physical libraries, these are a collection of reusable resources, which means every library has a root source. This is the foundation behind the numerous open-source libraries available in Python. 

Let’s Get Started!

1. Scikit- learn: It is a free software machine learning library for the Python programming language and can be effectively used for a variety of applications which include classification, regression, clustering, model selection, naive Bayes’, grade boosting, K-means, and preprocessing.

Scikit-learn requires:

  • Python (>= 2.7 or >= 3.3),
  • NumPy (>= 1.8.2),
  • SciPy (>= 0.13.3).

21 open source python libraries you should know

Spotify uses Scikit-learn for its music recommendations and Evernote for building their classifiers. If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip.

2. NuPIC: The Numenta Platform for Intelligent Computing (NuPIC) is a platform which aims to implement an HTM learning algorithm and make them public source as well. It is the foundation for future machine learning algorithms based on the biology of the neocortex. Click here to check their code on GitHub.

3. Ramp: It is a Python library which is used for rapid prototyping of machine learning models. Ramp provides a simple, declarative syntax for exploring features, algorithms, and transformations. It is a lightweight pandas-based machine learning framework and can be used seamlessly with existing python machine learning and statistics tools.

4. NumPy: When it comes to scientific computing, NumPy is one of the fundamental packages for Python providing support for large multidimensional arrays and matrices along with a collection of high-level mathematical functions to execute these functions swiftly. NumPy relies on BLAS and LAPACK for efficient linear algebra computations. NumPy can also be used as an efficient multi-dimensional container of generic data.

21 open source python libraries you should know

The various NumPy installation packages can be found here.

5. Pipenv: The officially recommended tool for Python in 2017 – Pipenv is a production-ready tool that aims to bring the best of all packaging worlds to the Python world. The cardinal purpose is to provide users with a working environment which is easy to set up. Pipenv, the “Python Development Workflow for Humans” was created by Kenneth Reitz for managing package discrepancies. The instructions to install Pipenv can be found here.

6. TensorFlow: The most popular deep learning framework, TensorFlow is an open-source software library for high-performance numerical computation. It is an iconic math library and is also used for machine learning and deep learning algorithms. Tensorflow was developed by the researchers at the Google Brain team within Google AI organisation, and today it is being used by researchers for machine learning algorithms, and by physicists for complex mathematical computations. The following operating systems support TensorFlow: macOS 10.12.6 (Sierra) or later; Ubuntu 16.04 or later; Windows 7 or above; Raspbian 9.0 or later.

7. Bob: Developed at Idiap Research Institute in Switzerland, Bob is a free signal processing and machine learning toolbox. The toolbox is written in a mix of Python and C++. From image recognition to image and video processing using machine learning algorithms, a large number of packages are available in Bob to make all of this happen with great efficiency in a short time.

8. PyTorch: Introduced by Facebook in 2017, PyTorch is a Python package which gives the user a blend of 2 high-level features – Tensor computation (like numpy) with strong GPU acceleration and developing Deep Neural Networks on a tape-based auto diff system. PyTorch provides a great platform to execute Deep Learning models with increased flexibility and speed built to be integrated deeply with Python.

9. PyBrain: PyBrain contains algorithms for neural networks that can be used by entry-level students yet can be used for state-of-the-art research. The goal is to offer simple, flexible yet sophisticated and powerful algorithms for machine learning with many pre-determined environments to test and compare your algorithms. Researchers, students, developers, lecturers, you and me – we can all use PyBrain.

21 Open Source Python Libraries you should know

10. MILK: This machine learning toolkit in Python focuses on supervised classification with a gamut of classifiers available: SVM, k-NN, random forests, decision trees. A range of combination of these classifiers gives different classification systems. For unsupervised learning, one can use k-means clustering and affinity propagation. There is a strong emphasis on speed and low memory usage. Therefore, most of the performance-sensitive code is in C++. Read more about it here.

11. Keras: It is an open-source neural network library written in Python designed to enable fast experimentation with deep neural networks. With deep learning becoming ubiquitous, Keras becomes the ideal choice as it is API designed for humans and not machines according to the creators. With over 200,000 users as of November 2017, Keras has stronger adoption in both the industry and the research community even over TensorFlow or Theano. Before installing Keras, it is advised to install TensorFlow backend engine.

12. Dash: From exploring data to monitoring your experiments, Dash is like the frontend to the analytical Python backend. This productive Python framework is ideal for data visualization apps particularly suited for every Python user. The ease which we experience is a result of extensive and exhaustive effort. 

13. Pandas: It is an open-source, BSD licensed library. Pandas enable the provision of easy data structure and quicker data analysis for Python. For operations like data analysis and modelling, Pandas makes it possible to carry these out without needing to switch to more domain-specific language like R. The best way to install Pandas is by Conda installation 

21 open source python libraries you should know about

14. Scipy: This is yet another open-source software used for scientific computing in Python. Apart from that, Scipy is also used for Data Computation, productivity, and high-performance computing and quality assurance. The various installation packages can be found here. The core Scipy packages are Numpy, SciPy library, Matplotlib, IPython, Sympy, and Pandas.

15. Matplotlib: All the libraries that we have discussed are capable of a gamut of numeric operations but when it comes to dimensional plotting, Matplotlib steals the show. This open-source library in Python is widely used for publication of quality figures in a variety of hard copy formats and interactive environments across platforms. You can design charts, graphs, pie charts, scatterplots, histograms, error charts, etc. with just a few lines of code. 

21 open source python libraries you should know

The various installation packages can be found here.

16. Theano: This open-source library enables you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. For a humongous volume of data, handcrafted C codes become slower. Theano enables swift implementations of code. Theano can recognise unstable expressions and yet compute them with stable algorithms which gives it an upper hand over NumPy. Follow the link to read more about Theano. The closest Python package to Theano is Sympy. So let us talk about it.

17. SymPy: For all the symbolic mathematics, SymPy is the answer. This Python library for symbolic mathematics is an effective aid for computer algebra system (CAS) while keeping the code as simple as possible to be comprehensible and easily extensible. SimPy is written in Python only and can be embedded in other applications and extended with custom functions. You can find the source code on GitHub. 

18. Caffe2: The new boy in town – Caffe2 is a Lightweight, Modular, and Scalable Deep Learning Framework. It aims to provide an easy and straightforward way for you to experiment with deep learning. Thanks to Python and C++ API’s in Caffe2, we can create our prototype now and optimize later. You can get started with Caffe2 now with this step-by-step installation guide.

19. Seaborn: When it comes to visualisation of statistical models like heat maps, Seaborn is among the reliable sources. This Python library is derived from Matplotlib and closely integrated with Pandas data structures. Visit the installation page to see how this package can be installed

20. Hebel: This Python library is a tool for deep learning with neural networks using GPU acceleration with CUDA through pyCUDA. Right now, Hebel implements feed-forward neural networks for classification and regression on one or multiple tasks. Other models such as Autoencoder, Convolutional neural nets, and Restricted Boltzman machines are planned for the future. Follow the link to explore Hebel.

21. Chainer: A competitor to Hebel, this Python package aims at increasing the flexibility of deep learning models. The three key focus areas of chainer include :

a. Transportation system: The makers of Chainer have consistently shown an inclination towards automatic driving cars and they have been in talks with Toyota Motors about the same.

b. Manufacturing industry: From object recognition to optimization, Chainer has been used effectively for robotics and several machine learning tools.

c. Bio-health care: To deal with the severity of cancer, the makers of Chainer have invested in research of various medical images for early diagnosis of cancer cells.

The installation, projects and other details can be found here.

So here is a list of the common Python Libraries which are worth taking a peek at and if possible familiarizing yourself with. If you feel there is some library which deserves to be in the list do not forget to mention it in the comments.

 

How Boredom Led To The Creation Of Python

Reading Time: 4 minutes

If necessity is the mother of invention, boredom is the father. From toothpaste with a squeezer to cupcakes, from mirror wipers to pizza scissors, it was sheer boredom that led to these wacky inventions.

Boredom always precedes a period of great creativity. Believe it or not, it was the boredom of a single man in the late 80s, that led to one of the ‘most searched words on Google’ surpassing ‘selfie’ in 2018 – Python.

In every great dynasty, there comes a successor, who in time, becomes the predecessor, and then the cycle repeats. With emerging technologies inclining towards Artificial Intelligence, Machine Learning and Deep Learning, Python has become a clear dominator.

Programming is a hard shell to crack when it is not your forte. At times, daunting and repulsive as well. It has been observed that Python has gained popularity not only among technical gurus but among amateurs as well. About 95% of all the data online was generated in the last 2 years. And with such massive amounts of data, the chances of securing a job have increased substantially. Python is a comparatively easy language to learn. Several libraries in Python are specifically written for Data Science, and this allows you to keep your coding to the minimum and yet get meaningful results.

The 2 main advantages of the language are its simplicity and flexibility. It’s simple and straightforward syntax and use of indented space makes it easy to learn, read and share. The people who practice the language, AKA Pythonistas, have uploaded 145,000 custom-built software packages to an online repository. Also called ‘library’. These cover everything from game development to astronomy and can be installed and inserted into a Python program in a matter of seconds.

To put things in perspective, in the past 12 months, the Google search for Python has surpassed the Google search for Kim Kardashian! Python was the top-ranked language on the Tiobe Index in 2007 and 2010, as is among the top contenders this year as well. The increase in job seekers’ interest in Python has witnessed a steep rise, surpassing R and SaS as of January 2018.

So What’s Next?

Python dates back to the late 80s and it was first implemented in 1989 by Guido van Rossum, the creator of Python, as a successor to ABC language, which is an imperative general purpose programming language. What makes Python desirable is its simplicity, flexibility, compatibility, versatility and the fact that it is free and the numerous open source libraries are the icing on the cake.

The language can be used for almost anything: from the basic ‘hello world!’ algorithm to complex Machine Learning algorithms for Face Recognition, Drone Imagery, Internet of Things, Gaming, Robotics, Natural Language Processing, healthcare and many more. Moreover, the code is concise and easily understandable even for those who have never written it before, at least in the nascent stages.

With such a rapidly growing user base, Python might seem destined to become the lingua franca of coding, rendering all other competitors obsolete. Having said that, researchers still believe that it is unlikely that Python will replace C or C++, which provide the user complete control over what is going on inside the processor, nor will it replace Java or Javascript which power most web pages today.

Although we can see Java at the top of the list, a more lucrative observation is the steady decline and the steep incline in the use of Java and Python respectively. Even though Python has become ubiquitous, its competitors haven’t left the battleground. From lower level languages like C and C++ to Java and Javascript, Python still has a tough competition to be wary of.

But in a surprising turn of events:

Guido Van Rossum to sys.exit()s due to overload.

Yes, Van Rossum is no longer associated with Python.

“Now that PEP 572 is done, I don’t ever want to have to fight so hard for a PEP and find that so many people despise my decisions.” – Guido Van Rossum.

A PEP is a Python Enhancement Proposal, which is similar to the Whatsapp update we receive as it upgrades Python with new features. He was also quoted saying, “I am not going to appoint a successor.”

So does this render the Great Python Dynasty, one without a successor? Only time will tell.

If you are into the early stage of your graduation or are planning to take up a course, you should definitely consider learning Python for Machine Learning as your first option.