“ Try to understand the problem statement first, keeping your trained intelligence aside and try to analyze the given data as if you don’t know anything about them. Your honest acceptance that you know nothing will lead you to the process of building a model worth deploying. “
Process is more important than the outcome in the field of Data Science
In our last article on supervised ML, we covered linear regression model which dealt with continuous attributes to find the impact of the independent variable on the dependent variable.
The whole exercise in the linear regression model was to find the best fit line which can predict the impact of the independent variable on the dependent or target variable. Linear regression deals with a problem where we need to predict
- How do sales figure changes with the number of working hours?
- Whatis the impact of age on the performance of the sports professional?
Here we are trying to predict the impact/changes observed on target variable sales/performance based on working hours/age. What about the problem where we want to clearly predict based on input data, the probability of patients being diabetic or non-diabetic, or to predict the probability that a dog will bark during the middle of the night or not.
This type of problem where we need to find the probability of any event occurring or not, or being true/false is called the classification problem. To solve this kind of problem we often employ one of the very popular supervised ML models called, Logistic Regression Model.
With this information let’s start today’s session on logistic regression where we will cover
- What Is logistic regression?
- How does it work?
- Types Of Logistics regression
- Implementing Logistic Regression Model Using Python
- Linear Regression Vs Logistic Regression
What Is Logistic Regression?
As per wiki,
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.
Remember, there are some cases where dependent variables can have more than two outcomes, like married /unmarried/divorced such scenarios are classified as multinomial logistic regression. Though they work in the same manner to predict the outcome.
Some Familiar Example Of Logistics Regression :
Some prominent examples like:
- Email Spam Filter: Spam /No Spam
- Fraud Detection: Transaction is fraudulent, Yes/No
- Tumor: Benign/Malignant
How Does Logistic regression Work?
As we are clear that logistics regression majorly makes predictions to handle problems which require a probability estimate as output, in the form of 0/1. Logistic regression is an extremely efficient mechanism for calculating probabilities. So you must be curious to understand how does it come out with the value of 0 or 1 always. To understand more let’s try to decode some math behind Logistic Regression.
Logistic Model: Sigmoid Function
Let us try to understand logistic regression by understanding the logistic model. As in linear regression let’s represent our hypothesis(Prediction Of Dependent Variable) in classification. In classification our hypothesis representation which tries to predict the binary outcome of either o or 1, will look like,
hθ(x) = g(θ T x) = 1/ 1 + e −θ T x ,
Here g(z) = 1/( 1 + e ^−z), is called the logistic function or the sigmoid function:
g(z): is a representation of logistic function, which we also call a Sigmoid function. From the above visual representation of sigmoid function, we can easily decipher how this curve describes many real-world situations, like population growth. In the initial stages, it shows exponential growth, but after some time, due to the competition for certain resources (bottleneck), the growth rate decreases until it gets to a stalemate and there is no growth.
The question here is, how does this logit (sigmoid function ) help us determine the probability of classifying data into different classes. Let’s try to understand how our logit function is calculated which will give us some clarity
Maths Behind Logistic Function:
Step 1: Classifying inputs to be in class zero or one.
First, we need to compute the probability that an observation belongs to class 1(we can also call it to be a positive class) using the Logistic Response Function. In this case, our z parameter is, as seen in the below-given logit function.
- The coefficient beta 0, beta 1, beta K as shown in the fig above are selected to maximize the likelihood of predicting a high probability for observations belonging to class 1,
- And predicting a low probability for observations actually belonging to class 0. The above fig is depicting what we are trying to do with the coefficients.
Log Odds( Logit Function):
The above explanation can also be understood in terms of logarithmic odds, which is a kind of understanding the probabilities to classify elements into classes(1 or 0) is by using ODDS:
These Odds which resemble similarity to linear regression is called the logit.
So, a logit is a log of odds and odds are a function of P. In logistic regression, we find
logit(P) = a + bX,
Step 2: Defining boundary values for the odds
We now define a threshold boundary in-order to clearly classify each given input value into one of the classes.
We can choose a threshold value as per thebusiness problem we are trying to solve, generally which is circled around 0.5. So if your probability values come out to be >0.5 we can classify such observation into class 1 type, and the rest into class 0. The choice of the threshold value is generally based on error types, which are of two types, false positives, and false negatives.
A false positive error is made when the model predicts class 1, but the observation actually belongs to class 0. A false negative error is made when the model predicts class 0, but the observation actually belongs to class 1. The perfect model would classify all classes correctly: all 1´s (or trues) as 1´s, and all 0´s (or false) as 0´s. So we would have FN = FP = 0.
Threshold Values Impacts:
1. Higher threshold value
Suppose if P(y=1) > 0.7. The model is being more restrictive when classifying as 1´s, and therefore, more False Negative errors will be made.
2. Lower threshold value:
Suppose, if P(y=1)> 0.3.
The model is now being less strict and we are classifying more examples as class 1, therefore, we are making more false positives errors.
Confusion Matrix: A Way To Choose An Effective Threshold Value:
A confusion matrix also known as error matrix is a predictor of model performance on a classification problem. The number of correct and incorrect predictions is summarized with count values and broken down by each class. This lie at the core of the confusion matrix.
The confusion matrix shows the ways in which your classification model is confused when it makes predictions on observations, it helps us to measure the type of error our model is making while classifying the observation into different classes.
Key Parts Of Confusion Matrix:
- True Positive (TP): This refers to the cases in which we predicted “YES” and our prediction was actually TRUE
(Eg. Patient is actually having diabetes and you predicted it to be True)
- True Negative (TN): This refers to the cases in which we predicted “NO” and our prediction was actually TRUE
(Eg. Patient is not having diabetes and you predicted the same. )
- False Positive (FP): This refers to the cases in which we predicted “YES”, but our prediction turned out FALSE
(Patient was not having diabetes, but our model predicted that s/he is diabetic)
- False Negative (FN): This refers to the cases in which we predicted “NO” but our prediction turned out FALSE
Key Learning Metrics From Confusion Matrix:
Confusion matrix helps us learn the following metrics, helping us to measure logistic model performance.
Overall, how often is the classifier correct?
Accuracy = (TP+TN)/total No of Classified Item = (TP+TN)/ (TP+TN+FP+FN)
When it predicts yes, how often is it correct?
Precision is usually used when the goal is to limit the number of false positives(FP). For example, with the spam, filtering algorithm, where our aim is to minimize the number of reals emails that are classified as spam.
Precision = TP/ (TP+FP)
When it is actually a positive result, how often does it predict correctly?
It is calculated as,
Recall = TP/ (TP+FN), also known as sensitivity.
It is just the harmonic mean of precision and recall:
It is calculated as,
f1-score = 2*((precision*recall)/(precision+recall))
So when you need to take both precision and recall into account this f1 score is a useful metric to measure. If you try to only optimize recall, your algorithm will predict most examples to belong to the positive class, but that will result in many false positives and, hence, low precision. Also, if you try to optimize precision, your model will predict very few examples as positive results (the ones which highest probability), but recall will be very low. So it can be insightful to balance out and consider both and see the result.
AUC Area Under The Curve:
One more useful metric to evaluate and compare predictive models is the ROC curve.
In statistics, a Receiver Operating Characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (sensitivity) against the false positive rate (1 — specificity) at various threshold settings.
ROC curves are a nice way to see how any predictive model can distinguish between true positives and negatives. The ROC curve plots out the sensitivity and specificity for every possible decision rule cutoff between 0 and 1 for a model.
Specificity or True Negative Rate = TN/(TN+FP)
Sensitivity or True Positive Rate= TP/(TP+FN)
So FPR, False Positive Rate = 1–Specificity
The intuition behind ROC curve is:
That the model which predicts at chance will have an ROC curve that looks like the diagonal green line(as shown above in the fig). That is not a discriminating model. The further the curve is from the diagonal line, the better the model is at discriminating between positives and negatives in general.
Types Of Logistic Regression:
- Binary logistic regression:It has only two possible outcomes. Example- yes or no
- Multinomial logistic regression: It has three or more nominal categories. Example- cat, dog, elephant.
- Ordinal logistic regression– It has three or more ordinal categories, ordinal meaning that the categories will be in order. Example- user ratings(1–5).
As we have understood some important caveats attached to Logistic Regression, it’s time that we take some practical understanding with a simple example :
Implementing Logistic Regression :
We are going to cover this model-building exercise in the following steps:
- Reading Data
- Analyzing Data(Basic EDA/Descriptive analysis)
- Train and Test(Break the sample data into two sets)
- Accuracy Report(Measuring the model performance using confusion matrix we discussed above )
Main Objective: To predict diabetes using a Logistic Regression Classifier.
1. Loading Data:
We will make use of the Pima Indian Diabetes dataset sourced from kaggle shared by UCI ML. Please download data from the following link:
Write/copy the code given below and run the same in your juypter notebook(make sure you have installed an anaconda distribution on your system ) when you run this code snippet, you will see the output as shown in the fig 1.0
#Importing the required packages to load and analyse pima-indian-#diabetes.csv data setimport numpy as np
import pandas as pd
# load dataset
pima_df = pd.read_csv("pima-indians-diabetes.csv")
Exploratory Data Analysis:
Let’s study the given data set to find
- Non-Numeric values
- Missing/Null value
Non-Numeric & Null Value Analysis:
Write the following code piece and compile it :
# Let us check whether any of the columns has any value other than #numeric i.e. data is not corrupted such as a "?" instead of
## a number. And also find if there are any columns with null/missing values print(pima_df[~pima_df.applymap(np.isreal).all(1)])
You will find that there are no non-numeric attribute as array returned has an empty index values against each column.
Let’s perform some descriptive analysis to find
- Mean & Median
- Correlation Using Pairplot.
We can analyze each column using pandas describe() method to get the statistical summary of all the attributes. This analysis helps us to find which column is highly skewed, how do tails look like, what are mean , median and quartile values of each column.
Write/copy down the following code in your notebook and compile the same:
import seaborn as sns, numpy as np#Let's Do Some Descriptive Analysis using
ax = sns.distplot(pima_df['test'])
Quick Observation :
- Data for almost all the attributes are skewed, especially for the variable “test”. The mean for the test is 80(rounded) while the median is 30.5 which clearly indicates an extremely long tail on the right, which can be seen in the seaborn distplot which has been plotted above.
- Attributes like plas, pres, skin, and mass look somewhat normally distributed.
Let’s understand more about all the dataframe attributes in detail using pairplot visualization.
#Piarplot Data Visualization, type this code and see the output
sns.pairplot(pima_df, diag_kind ='kde')
- Some of the attributes preg, test, pedi, age looks like they may have an exponential distribution
- Age probably should have a normal distribution, but due to the constraints on the data collection may have lead to the skewed distribution.
- There is no obvious relationship between age and onset of diabetes.
- There is no obvious relationship between pedi function and onset of diabetes.
- The scatter plots for all attributes clearly shows that there is hardly any relation between them as mostly cloud type distribution is observed.
Let us look at the target column ‘class’ to understand how the data are distributed amongst the various values.
#write the given code print(“group by class:”, pima_df.groupby([“class”]).count())
- Most of the attributes are non-diabetic. The ratio is almost 1:2 in favor or class 0(non-diabetic). There are 500 records for non-diabetic class and 268 for diabetic, which will make our model more biased in predicting class 0 better than class 1(diabetic).
- Hence it is recommended to collect some more samples weighing both the classed sufficiently to make our model more performing and effective. For the time being, let’s proceed to build our logistic model and see how it scores against the given dataframe.
Logistic Model Using SkLearn & Python:
Import Sklearn Packages:
Import the LogisticRegression model & other packages required, from the sklearn python package as shown below:
from sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as plt#To Split our Data set into training and test data
from sklearn.model_selection import train_test_split# To calculate accuracy measures and confusion matrix
from sklearn import metrics
Split Data Into Training & Test Data:
# select all rows and first 8 columns which are the #independent attributesX = array[:,0:8]# select all rows and the 8th column which is the target column #classification
# "Yes", "No" for diabetes(Dependent variable)Y = array[:,8]
test_size = 0.30 # taking 70:30 training and test set
seed =1 # Random number seeding for reapeatability of the code#Splitting Data into train-test where 70% will be used for training #and rest 30% for Testing our model built on test dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
Let’s Build Our Model :
# Use method LogisticRegression() imported from sklearn Logistic_model = LogisticRegression()#Let's pass our training data sets which are X_train and y_train #Our fit method below actually do all the heavy lifting to come up #with sigmoid function which we discussed earlier
# So here we get our optimal surface
# which will be our modelLogistic_model.fit(X_train, y_train)
Let’s See How Our Model Labels X_train Data To Make A Classification:
#Let's pass X-Train data to our model and see how it predicts to
# label all the independent training data as shown below: y_predict= model.predict(X_test)print("Y predict/hat ", y_predict)#compile the code and you will see the output as given below
You can see that using model.predict(X_test) function our model classified each column attribute (X-train) as 0/1 as a prediction
Time To Measure How Model Has Performed(Scored)
Before that let’s find out the coefficient values of the plane(surface ) our model has found as a best-fit surface using the below-given code:
#coefficient can be calculatedas shown below making use of #model.coef_ methodcolum_label = list(X_train.columns) # To label all the coefficientmodel_Coeff = pd.DataFrame(model.coef_, columns = column_label)
model_Coeff['intercept'] = model.intercept_
print("Coefficient Values Of The Surface Are: ", model_Coeff)
When you compile this code you will get the output as shown below. This is a kind of linear model which has below-given coefficients and intercept of -5.058877. These values are nothing but the
Which, get’s fed into our Sigmoid function
sigmoid, g(z) = 1/( 1 + e ^−z).
Let’s see how our best fit model scores against the untrained test data using the underlying logistic function(sigmoid function)we discussed above.
#Pass the test data and see how our best fit model scores against #themlogmodel_score = model.score(X_test, y_test)
print(“This is how our Model Scored:\n\n”, logmodel_score)
Model Score comes out to be 0.774 which in terms of percentage is 77.4%. It is not up to the mark. Also, it is imperative to point here, that earlier we discussed how the diabetic class was under represented as compared to non-diabetic class in terms of sample data, so we should seldom rely on this model and measure further using confusion matrix class level metrics(Recall, Precision etc.. )
Let’s Measure Model Performance Using Confusion Metrics:
# Note That In Confusion Matrix
# First argument is true values,
# Second argument is predicted values
# this produces a 2x2 numpy array (matrix)print(metrics.confusion_matrix(y_test, y_predict))# Lets run this and see the outcome below:
Our method metrics.confusion_matrix method comes up with a square matrix shown above where our model. Where X_test is our row and y_predict is our column values. Our confusion matrix comes up with the following result which goes on to show that our model
- Correctly Predicted 47 patients to be diabetic(True Positive) and 132 to be non-diabetic(True Negative)
- Correctly, predicted 14 patient to be diabetic(False Positive) and 38 to be non-diabetic(False Negative)
Let’s Calculate Recall Value: A Class Level Metric To Measure Model Performance:
Recall(For Non_diabetic) = TP/(TP+FN )
Here TP = 132,
Recall = 132/(132+14)= 132/146 = 0.90 = 90 %
Recall(For Diabetic) = TP/(TP+FN )
TP = 47, FN = 38
Recall(For Diabetic)= 47/85 = 0.55 = 55 % ,
This model is performing poorly in the case of diabetic, which is quite visible because of the lack of sample data being available in diabetic class, for modeling, as we discussed earlier.
- Precision (For Non- Diabetic) = TP/(TP+FP)= 132/ 170 = 0.77 = 77%
- Precision (For Diabetic) = TP/(TP+FP)= 47/ 61 = 0.77 = 77%
which is low specially given the nature of the problem(Here healthcare industry) we are trying to solve where accuracy of more than 95% is expected.
- We will look into Naive Bayer’s Classifier model, which is a kind of probabilistic model based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.
- Will implement Naive Bayer’s Classifier model using python.
I would like to end this piece on Logistic Regression, with food for thought.
Never trust what you know instead question and find the answer for yourself.