Few lines I wrote, dedicated to data engineers:
Data Data everywhere, consumers are now more aware
So mine the data with utmost care, and serve them everywhere.
Yes, that valuable it is to treat and process data with the required precision so that you can serve your customers/consumers effectively and responsibly.
In Applied statistics we try to ensure the data is reliable and clean to help us build a model that works well to find the hidden patterns. In order to analyze the given set of input data sets, the field of applied statistics broadly makes use of :
1. Descriptive Statistics
2. Inferential Statistics
Today we will cover Descriptive Statistics in detail and a little bit of inferential statistics basics. Inferential statistics we will cover in more detail in the next, part of Applied statistics in Data Science
It enables a meaningful and simpler interpretation of data, to help you visualize data in a better way(in the form of simple graphs )
As per Investopedia,
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, while measures of variability include the standard deviation, variance, the minimum and maximum variables, and the kurtosis and skewness.
Key Caveats Of Descriptive Statistics:
As you can find from the above-given definition, descriptive statistics are simply a way to describe our data. However, it doesn’t allow us to make conclusions beyond the data we have analyzed(this part is handled using inferential statistics)
Key mechanism a descriptive statistic employs to summarize and describe our data sets is by finding:
- Central Tendency in a given data set
- Spread of the data(variability of data)
A: Measure Of A Central Tendency
It’s a way of finding/describing the central position of a frequency distribution from within the given data sets. A measure of central tendency is a single value, that attempts to describe a set of data by identifying the central position within that set of data.
In applied statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution. It may also be called a center or location of the distribution. Colloquially, measures of central tendency are often called averages.
How Do We Measure Central Tendency?
In order to measure a central tendency in a given data , we use 3 M’s
Let’s get into details of each of this quickly:
Measuring Through Mean(Arithmetic Mean):
These we all have used in our school/college days and also most familiar with it.
As per wiki:
In statistics, the arithmetic means, or simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results of an experiment or an observational study, or frequently a set of results from a survey.
The mean (or average) is the most popular and well-known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean as you all know is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, …, xn, the sample mean, usually denoted by
(pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capital letter,
, pronounced “sigma”, which means “sum of…”:
Here x bar represents the sample mean, it is imperative to understand here is that we are talking about the sample mean and not the population mean. The sample is a small set of data, carved out of a population (which a huge collection of data set)
To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter “mu”, denoted as µ:
One key property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.
Measuring Through Median:
Median is the middle number in a sorted list of numbers. To determine the median value in a sequence of numbers, the numbers must first be arranged in value order from lowest to highest. If there is an odd amount of numbers, the median value is the number that is in the middle, with the same amount of numbers below and above.
If there is an even amount of numbers in the list, the middle pair must be determined, added together and divided by two to find the median value. The median can be used to determine an approximate average, or mean.
The median is sometimes used as opposed to the mean when there are outliers in the sequence that might skew the average of the values. The median of a sequence can be less affected by outliers than the mean.
The median and the mode are the only measures of central tendency that can be used for ordinal data, in which values are ranked relative to each other but are not measured absolutely.
Case 1: When there is a middle value that separates the entire data sets into 2 equal subsets of data.
List: 30, 13, 20, 34, 11, 22, 45
Arrange the values in ascending order as given below:
So now list becomes : 11, 12, 20, 22, 30, 34, 45
Here median is : 22 which divides all the values equally in two halves(3 values on each side)
To find the median value in a list with an even amount of numbers, first, arrange the numbers in order from lowest to highest:
List: 3, 13, 2, 34, 11, 26, 47, 17
Arranged in order, the list becomes: 2, 3, 11, 13, 17, 26, 34, 47
The median is the average of the two numbers in the middle: 2, 3, 11, 13, 17, 26, 34, 47
13 + 17 = 30 30/ 2 = 15. Fifteen is the median value in this list of numb
Measuring through Mode:
Is the value which occurs more frequently in a given data set. To determine the mode, you might again order the scores as shown above and then count each one. The most frequently occurring value is mode.
If X is a discrete random variable, the mode is the value x (i.e, X = x) at which the probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled.
For example, the mode of the sample
List1: 1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17
Here mode is 6.
Given the list of data :
List2: 1, 1, 2, 4, 4
Here the mode is not unique — the dataset may be said to be bimodal, while a set with more than two modes may be described as multimodal.
Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:
When To Use What In Descriptive Statistics To Measure Central Tendency?
Here are the following summary to know what the best measure of central tendency is with respect to the different types of variables.
Type of VariableBest measure of central tendency:
For Nominal: Mode
For Ordinal: Median
For Interval/Ratio (not skewed): Mean
For Interval/Ratio (skewed): Median
The Case Of Skewed Distribution:
Sometimes data is not normally distributed. It is imperative that we test our data sets for its normal distribution because this is a common assumption underlying in many statistical analysis.
When you have a normally distributed sample you can use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency because it is the measure that includes all the values in the data set for its calculation.
The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.
B: Spread Of The Data(Variability Of Data)
Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item).
A measure of spread, sometimes also called a measure of dispersion, is used to describe the variability in a sample or population. It is usually used in conjunction with a measure of central tendency, such as the mean or median, to provide an overall description of a set of data.
Measures of spread include 3 important categorization :
- Quartiles and the interquartile range,
- Variance and standard deviation.
Let’s quickly cover all these
The range is the difference between the highest and lowest scores in a data set and is the simplest measure of spread.
Range = maximum value — minimum value
Example : 22,45,56,32,10,9,54
Here in the above data set, Max = 56, Min = 9
So range = Max- Min = 56–9 = 47
Range as a measure of spread is used not very popular, but it does set the boundaries of the scores. This can be useful if you are measuring a variable that has either a critical low or high threshold or both, that should not be crossed.
In statistical analysis, the range is represented by a single number. In financial data, this range most commonly refers to the highest and lowest price value for a given day or another time period.
Quartiles & Interquartiles Range:
The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.
Let’s understand what are quartiles first and then we will get deeper into understanding IQR concept through some examples
Quartiles divide an ordered dataset into four equal parts and refer to the values of the point between the quarters. A dataset may also be divided into Quintiles (five equal parts) or deciles (ten equal parts).
A quartile is a type of quantile. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set.
Example 1 :
List = [25, 33, 14,31,54,76,57,87, 81]
Let’s find the median first:
Median = 54 , it separates the given data sets into to equal halves
So Q2=54(median of whole table)
Q1=14(median of upper half, from row 1 to 5)
Q3=57(median of lower half, from row 5 to 9)
For the above example :
IQR(Inter Quartile Range ) = Q3 — Q1 = 57–14= 43
Example 2: Source
Data set in a plain-text box plot
* |−−−−−−−−−−−| | |−−−−−−−−−−−|
+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+ number line
0 1 2 3 4 5 6 7 8 9 10 11 12
For the data set in this box plot:
- lower (first) quartile Q1 = 7
- median (second quartile) Q2 = 8.5
- upper (third) quartile Q3 = 9
- interquartile range, IQR = Q3 — Q1 = 2
- lower 1.5*IQR whisker = Q1–1.5 * IQR = 7–3 = 4
- upper 1.5*IQR whisker = Q3 + 1.5 * IQR = 9 + 3 = 12
The interquartile range is often used to find outliers in data.
Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. In a boxplot example discussed above , the highest and lowest occurring value within this limit are indicated by whiskers of the box and any outliers as individual points.
Quartiles are a useful measure of spread because they are much less affected by outliers or a skewed data set than the equivalent measures of mean and standard deviation. For this reason, quartiles are often reported along with the median as the best choice of measure of spread and central tendency, respectively, when dealing with skewed and/or data with outliers
Variance & Standard Deviation:
Variance is one of the most popular ways to measure the data spread of the given data set around the mean.
So let’s first try to understand what variance actual means
Variance (represented mathematically as σ2) is a measurement of the spread between numbers in a data set. It measures how far each number in the set is from the mean(central tendency) and is calculated by taking the differences between each number in the set and the mean, squaring the differences (to make them positive) and dividing the sum of the squares by the number of values in the set.
In datasets with a small data spread, all values are very close to the mean, resulting in a small variance and standard deviation. Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation.
The smaller the variance and standard deviation, the more the mean value is indicative of the whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and variance are zero.
Variance Formula :
The population Variance σ2 (pronounced sigma squared) of a discrete set of numbers is expressed by the following formula:
Xi represents the ith unit, starting from the first observation to the last
μ represents the population mean
N represents the number of units in the population
! Remember in above formula we are talking about the entire population of a data set.
For Sampling we calculate variance as given below:
The Variance of a sample s2 (pronounced s squared) is expressed by a slightly different formula:
xi represents the ith unit, starting from the first observation to the last
x̅ represents the sample mean
n represents the number of units in the sample
The standard deviation is the square root of the variance. The standard deviation for a population is represented by σ, and the standard deviation for a sample is represented by s.
a useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data.
In addition to measuring the variability of a population, the standard deviation is also used to measure confidence in statistical conclusions. For example, the margin of error in polling data is determined by calculating the expected standard deviation in the results if the same poll were to be conducted multiple times.
Understanding Variance & Standard Deviation By Example(Src):
Let’s understand Population Variance σ2 and Standard Deviation σ with the example given below
A = [4 , 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8]
So the population mean (μ) of A: (4 + 5 + 5 + 5 + 6 + 6 + 6 + 6 + 7 + 7 + 7 + 8) / 12
Mean (μ) = 6
Calculate the deviation of the individual values from the mean(6 calculated above ) by subtracting the mean from each value in the dataset using the below given formula:
= -2, -1, -1, -1, 0, 0, 0, 0, 1, 1, 1, 2
Square each individual deviation value
= 4, 1, 1, 1, 0, 0, 0, 0, 1,1,1, 4
Calculate the mean of the squared deviation values
=(4 + 1 +1 +1 + 0 + 0 + 0 + 0 +1 +1 +1 + 4) / 12
Variance σ2= 1.17
Calculate the square root of the variance
Standard deviation σ= 1.08
B= [1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11]
So the population mean (μ) of Dataset B:
(1 + 2 + 3 + 4 + 5 + 6 + 6 + 7 + 8 + 9 + 10 + 11) / 12
Mean (μ) = 6
Calculate the deviation of the individual values from the mean(6 calculated above ) by subtracting the mean from each value in the dataset
= -5, -4, -3, -2, -1, 0, 0, 1, 2, 3, 4, 5,
Square each individual deviation value
= 25, 16, 9, 4, 1, 0, 0, 1, 4, 9, 16, 25
Calculate the mean of the squared deviation values
=(25 + 16 + 9 + 4 + 1 + 0 + 0 + 1 + 4 + 9 + 16 + 25) / 12
Variance σ2 = 9.17
Calculate the square root of the variance
Standard deviation σ = 3.03
The larger Variance and Standard Deviation in Dataset B further demonstrates that Dataset B is more dispersed than Dataset A.
Variance Vs Standard Deviation :
Found one interesting infographic given below, which explains the concept beautifully:
We understood here about descriptive statistics where we learned how to describe/summarize effectively the given set of data( population/sample ) at the initial level of EDA, using data statistics concept, before we start building our data models. We understood the fact that the data reliability is of utmost importance if we really want to build an effective machine learning models. Descriptive statistics only help us to build our observation around the data provided but if we really have to make intelligent predictions we can’t rely only on it. For this, we have the concept in applied statistics called,
Inferential statistics are concerned with making inferences based on relations found in the sample, to relations in the population. Inferential statistics help us decide, for example, whether the differences between groups that we see in our data are strong enough to provide support for our hypothesis that group differences exist in general, in the entire population.
We will cover this in detail, in our next part of this Applied statistics for data science aspirant, what it is, and how it helps us to measure & establish data reliability to make intelligent predictions around data population/sample.
If you are looking to be an effective data science engineer please make sure you clearly understand the fundamentals of applied statistics. Applied statistics is the foundation stepping stone which will pave a successful career path for you. When you start understanding data sets confidently, you will be able to measure data skews, find missing values, measure data variability, which in turn will help you clean up your data to make it reliable & useful for data modeling.
Leaving you all with this food for thought:
“Being a data science engineer is more about designing great processes, built around data which is more reliable and trustworthy. “
Keep reading, keep supporting
Thanks For Being There…