Inferential Statistics: Hypothesis Testing Using Normal Deviate Z -Test

What Is Z Test In Inferential Statistics & How It Works?

I Feel:

The more you analyze the data the more enlightened, data engineer you will become.

What Is Normal Deviate Z Test & How It Works ?

In data engineering you will always find an instance where you need to establish whether the data sample which you have got from population data, is reliable enough to build a model around it. There can be an instance where you may have got the data from the old archive, which may not represent the true behavior of process modeled around it in a production environment, with time behavior changes and so the process on which model was built.

So if we go ahead and build our new model around such old sample data, we may end up with a faulty process and the model will not be effective or useful.So what we do is to perform certain inferential statistical test to ensure data is reliable.

One such test is, Normal Deviate Z Test, where we test our sample data to infer if it has come from the population data which is a true representation of process behavior in a production environment, before we go ahead to build a model around it.

Earlier in part 1 of Inferential statistics we learned about Chi-Sqaure test

I would invite you all to read the same. As promised, today we will cover more statistical testing technique being used in inferential statistic hypothesis testing to establish sample data reliability. So let’s get started with understanding one such test called normal deviate Z Test which we will be covering in detail moving forward in our journey.

What Is Normal Deviate Z Test & How It Works ?

When we try to establish data reliability of a large sample data set (sample size > 30 is the norm)using Normal deviate Z test we try to compare two distribution means of data like the given sample data in our data science project and the production data.

The Z-test compares sample and population means to determine if there is a significant difference.The Z test statistic is assumed to have a normal distribution, and nuisance parameters such as standard deviation should be known in order to perform an accurate z-test

How Normal Deviate Z test Work ?

We will understand how Z test functions in following steps

Step1 : Establishing Hypothesis:

It is the first thing data engineers need to state before we go to perform any statistical test in inferential statistics.

H0 — The difference in means between sample variable and population mean is a statistical fluctuation.

# H1 — The difference in means between sample BP column and population mean is significant. The difference is too high to be the result of statistical fluctuation

Step 2: Calculating Z test statistic

Before we calculate , here are the required

Pre-Requisites: In-order to perform Z test on a normal distribution of data, there are some prerequisites:

• Number of samples >= 30,
• The mean and standard deviation of the population should be known

Z-test statistic Formula For Calculation:

The Z measure is calculated as:

Z = (M — μ)/ SE

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

SE is calculated using the below given formula:

SE = s/ SQRT(n)

Where, s is the population standard deviation and n is the sample size.

Standard_error is the standard deviation of the sample distribution of means (Central Limit Distribution)

The above given formula may look very similar to Z score calculation as both Z score calculation and Z Norm_dev are an instance of test of statistical significance

Step 3: Analyze the Z value to interpret P-Value

Once we have the Z value we go ahead to calculate the p-value, based on which we will be able to accept or reject the null hypothesis .

Example Using Python & Jupyter Notebook:

So let’s try to understand the above given steps using a practical example .

Install Anaconda distribution:

Once you are done with installation, launch your Jupyter notebook and write the following code(copy the below code ) to get started .

Import The Required Package:

Let’s important some relevant python packages as shown below and create a dataframe by reading “pima-indians-diabetes.csv” sourced from a Kaggle

`import pandas as pdimport numpy as npimport scipyimport matplotlib.pyplot as plt import scipy.stats as stimport seaborn as sns`
`#Reading CSV file into df as pandas dataframe`
`df= pd.read_csv(“pima-indians-diabetes.csv”)`

Let’s view the dataframe by calling the method df.head(20) to view the data series in the given sample data set.

`df.head(20)`

Step 1: Let’s Formulate Our Null and Alternate Hypothesis:

Null Hypothesis:

# H0: The difference in the mean between sample BP(Press column visible above in dataframe table ) column and population mean for BP is a statistical fluctuation.

Alternate Hypothesis:

# H1 — The difference in Mean between sample BP column and population mean is significant, and is not a case of mere statistical fluctuation

Step 2: Let’s Calculate Z Stat (Z Test):

Z = (M — μ)/ SE

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

So let’s do this calculation in Jupyter Notebook :

Here is the code snippet to calculate Z test:

`# Pre - Requisites -  Number of samples >= 30, the mean and standard deviation of population should be known`
`# Here we have  Avg and Standard Deviation for  diastolic blood pressure = 71.3 with standard deviation of 7.2     ## Let's Apply of Normal Deviate Z test   on blood pressure(Press) column of given dataframe`
`#mu = μ mu = 71.3   # source - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/BS704_BiostatisticsBasics3.htmlstd = 7.2`
`#Let's find the M, mean of BP column(Press) in a given data frame `
`MeanOfBpSample = np.average( df['Pres'])print("Mean Of BP Column", MeanOfBpSample)`
`SE= std/np.sqrt(df.size)  #sf.size id the total size of  `
`# Z_norm_deviate =  sample_mean - population_mean /std_error_bp`
`Z_norm_deviate = (MeanOfBpSample - mu) / SE`
`print("Normal Deviate Z Value: ", Z_norm_deviate)`

If you type the above code in your notebook you will be able to see the below given output

`Mean Of BP Column 69.10546875Standard Error:  0.08660254037844387Normal Deviate Z value : -25.340264158650886`

Now that we know the Z Test Value, let’s find our p-value

Calculating P-Value, Code Snippet:

`# We will be using scipy stats normal survival function sf#Here we mulitply the sf fucntion with 2 for two sided p value #calcultion , a two tail test `
`p_value = scipy.stats.norm.sf(abs(Z_norm_deviate))*2 print('p values' , p_value)`
`if p_value > 0.05:print('Samples are likely drawn from the same distributions (fail to reject H0)')else: print('Samples are likely drawn from different distributions (reject H0)')`

If you run the above code snippet in Jupyter you will get the following outcome:

Step 3: Analyze the Z value to interpret P-Value

As you can see above, p-value comes out to be: 1.150581011903455e-141. As p-value is less than accepted industry standard of 0.05, we can conclude that the given sample has not come from the same population distribution, on which process was built. There is a significant difference in Means between sample BP column and population mean, so we have to reject the Null hypothesis H0, and accept alternate hypothesis:H1.

As we reject the null hypothesis here using normal Z deviate test, it will be recommended to avoid building any ML model on this sample data.

Aspiring/working data engineers need to have a clear understanding of p value. As this will be the basis of performing most of the statistical data reliability test . So let me quickly cover few basic stuff about the same here and we will look into it more deeply in the special article which I will frame only around P value for you all.

What Is P Value?

The p-value, or probability value, tells us the probability of getting a value as small or as large as the one observed in the sample, given that our null hypothesis is true.

How to calculate p-value in general ?

2. Assume the null hypothesis to be true
3. Calculate the z or t value for getting the value in the alternative hypothesis
4. From the z/t-table, find the probability associated with the z or t value obtained above. You can also find p value with Scipy inbuilt methods you just need to pass z, t statistics calculated in step 3.
5. This is the p-value you need to find

We will cover P value calculations, how to interpret it and its use cases separately later on. Also, you will also experience it while we cover all the hypothesis test types in our journey of understanding inferential statistics.

What’s Next ?

In our next article: “Inferential Statistics: Hypothesis Testing using T Test”. We will cover T test in detail.

Would like to leave you all by covering some basics of T test .

What Is T Test ?

A t-test is a kind of inferential statistic used to find if there is a significant difference between the means of two given groups, which may be related in certain features.

A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the probability of difference between two sets of data

Types Of T Test:

There are three types of t-test:

One-sample t-test:

Used to compare a sample mean with a known population mean or some other meaningful, fixed value

Independent samples t-test:

Used to compare two means from independent groups

Paired samples t-test:

1. Used to compare two means that are repeated measures for the same participants — scores might be repeated across different measures or across time.
2. Used also to compare paired samples, as in a two treatment randomized block design.

Will cover how we perform the above given T test using examples and hands-on lab exercises.

Do refer below given graphics which cover the decision making tree to help you chose the right kind of hypothesis testing based on the given problem statement .

Summary:

Never ever rely on plain observation or assumption while you try to build a model on the given sample. Make sure you are measuring it’s distribution type, testing the data sample using statistical hypothesis testing to ensure your sample data is reliable. Descriptive statistics & inferential statistical techniques are designed to help you make better decisions, in data sampling before modeling it in machine learning.

As data cleansing, EDA, will fill larger part of your work life as a data scientist, it’s imperative that you take responsibility of handling data with utmost clarity & care to test it out for its reliability. You are going to influence the market dynamics in a larger way, as your model is going to take some really critical business decisions.

I Feel :

Going wrong with data interpretations while building ML models may cost heavily. So don’t just build models for the sake of building, make sure it has been fed with the right kind of food in terms of data. Your right data feeding habit will do wonders when your machine will make intelligent & precise, ML based predictions and recommendations for your business . Everybody in the ecosystem will be the beneficiary of the right model building process if it’s done right.