Supervised Machine Learning Using Linear Regression

Data science with the kind of power it gives you to analyze each and every bit of data you have at your disposal, to make smart & intelligent business decisions, is becoming a must have tool to understand and implement in your organization , it is very important that your business decisions are not based on intuition rather based on data analysis.

Being a data science learner & practitioner, very often

I feel:

“Data which you have in your repository is a gold mine, which needs to be harnessed with an intent to serve the humanity at large, as they are the key source of the same data. “

Data has a story to tell. Being a data engineer and a business leader it’s your primary responsibility to treat them well, process it with an appropriate ML model and build a solution that is relevant for both current and future user needs. With this intent, let’s begin our journey of understanding supervised ML using the Linear Regression model.

Agenda Of Today’s Article:

1. What Is Supervised Machine Learning?
2. Type Of Supervised Machine Learning?
3. What Is Regression & Its Type?
4. Understanding Linear Regression With Example?
5. Hands-On Labs Exercise On Linear Regression Using Python & Jupyter

1. What Is Supervised Learning ?

Supervised Learning:

In supervised learning, we are given a labeled data set(labeled training data) and desired outcome is already known, where every pair of training data has some kind of relationship.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The intent is to train the function so such an extent that whenever we have any new input data (x) you can easily predict the output variables (Y) for that given set of input data.

So here the training happens under the supervision of a teacher/assistant who already has the knowledge of correct answers and the algorithm iteratively makes predictions on the training data and is corrected by the supervisor. So when our learning algorithm achieves the acceptable level of training performance we put an end to the learning process.

Types Of Supervised ML:

The most fundamental way one can categorize any supervised learning methodology is based on the type problem statement it is trying to solve. At the high level we can also say like what is the kind of business problem one is trying to solve using Supervised Machine Learning algorithms.

So, Within supervised machine learning we further categorize problems into the following categories:

1. Regression
2. Classification

1. Regression

Regression problems are the problems where we try to make a prediction on a continuous scale. Examples could be predicting the stock price of a company or predicting the temperature tomorrow based on historical data. Here temperature or sales parameters are continuous variables and we are trying to predict the change in sales value based on certain, given input variables like man-hours used, etc..

Regression is a method of modeling a target value based on independent predictors. This method is mostly used for forecasting and finding out the cause and effect relationship between variables. Regression techniques, mostly differ based on the number of independent variables and the type of relationship between the independent and dependent variables.

Regression Types :

• Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Decision Tree Regression
• Random Forest Regression

We will cover only Linear regression today and rest we will cover later.

What Is Linear Regression?

It is made up of two words Linear & regression. Let’s understand both before we get into the definition of linear regression

Linear: The word linear comes from the Latin word linearis, which means pertaining to or resembling a line

Regression: a kind statistical technique for estimating the relationships among dependent & independent variables.

Let’s combine them and define:

Linear Regression:

It is a statistical approach to model between a dependent variable and one or more explanatory variables (or independent variables) to come up with a best fit linear line(linear equation, using least squared approach) represented in a most simplified manner as:

Simple linear regression,

y​=β0​+β1​X

X=explanatory variables,

β0​=y-intercept (constant term),

β1​=slope coefficients for the explanatory variable,

We use linear regression to find the relationship between dependent & independent variable to find the best attribute (input variable)to use for model building in solving the regression type problems.

Linear Regression is further classified as

• Simple linear regression : It has only one explanatory variable
• Multiple linear regression: It has more than oneexplanatory variable. Here multiple correlated dependent variables are predicted, rather than a single scalar variable(dependent variable)

It represents line fitment between multiple inputs and one output, typically:

The Formula for Multiple Linear Regression Is

yi​=β0​+β1​xi1​+β2​xi2​+…+βpxip​+ϵ

where, for i=n observations:yi​=dependent variable

xi​=explanatory variables,

β0​=y-intercept (constant term),

βp​=slope coefficients for each explanatory variable,

ϵ=the model’s error term (also known as the residuals)​

In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that involves more than one explanatory variable.

In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data.

Linear predictor functions:

In statistics and in machine learning, a linear predictor function is a linear function (linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent variable.

You will very often come across in our discussion the terms like dependent & independent variables, let’s try to understand the same , so that our further discussion more sense going forward

Dependent Variables:

In mathematical /statistical modeling the values of dependent variables depend on the values of independent variables.

The dependent variables represent the output or outcome whose variation is being studied. So in simplified terms

Whenever you try to predict any change in output variable based on any given input variable, this output variable is known as the dependent variable. We can also call them to be a target variable when we analyze them using linear regression model.

Independent variables:

Also known in a statistical context as regressors, represent inputs or causes, that is, potential reasons for variation. This input variable which is largely mapped in the linear line equation to predict the possible outcome is known as dependent variables.

Example: Simple linear regression line(works for two variable)

y= mX+C

Here y= dependent variable(Target variable) ,

X= Independent (input ) variable.

It is best practice to represent independent variable to be C=capital letter & dependent variable as a small letter. Though it’s not enforced.

Key Concepts Of Linear Regression:

One needs to understand few important concepts to make better sense of linear regression model which we will study going forward with example. So let’s cover them quickly

Prerequisites:

To start with Linear Regression, you must be aware of a few basic concepts of statistics. i.e.

• Correlation (r): Explains the relationship between two variables, possible values -1 to +1
• Standard Deviation (σ) : Measure of spread in your data (Square root of Variance)
• Normal distribution: Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.
• Residual (error term): Actual value(Which we have ) Minus Predicted value(Which came using linear regression )

To understand variance, standard deviation , normal distribution, please refer to my article below:https://www.mlanalytics.in/descriptive-statistics-fundamentals-for-data-science-aspirants/

Key Assumptions In Linear Regression Model:

If we are building linear regression model we need to take care of following assumptions, in order to build an effective model which works well.

• The Dependent variable is continuous
• There is a Linear relationship between Dependent Variable and Independent Variable.
• There is no Multicollinearity (no relationship between Independent variables
• Residuals should follow Normal Distribution.
• Residuals should have constant variance: Homoscedasticity
• Residuals should be independently distributed/no autocorrelation

To check relationship :

Between dependent and independent variable you can,

1. Perform Bivariate Analysis
2. Calculate Variance Inflation factor: value which is closer to 1 and till maximum 4

To find whether residuals are normally distributed or not you can,

• Perform Histogram/ Boxplot
• Perform Kolmogorov Smirnov K’s test

To check Homoscedasticity:

• You can Plot Residuals Vs. Predicted values and there should be no pattern in between them when you visualize them using data visualization tools.
• Perform the Non-Constant Variance Test.

Linear Regression Learning Model Type:

1. Simple Linear Regression:

In simple linear regression when we have a single input, we can use statistics to estimate the coefficients. This requires that you calculate statistical properties from the data, such as means, standard, deviations, correlations and covariance.

2. Ordinary Least Squares:

When we have more than one input we can use Ordinary Least Squares to estimate the values of the coefficients. The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals.

It works by starting with random values for each coefficient. The sum of the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction towards minimizing the error.

The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.Here we select a learning rate (alpha) parameter that determines the size of the improvement step to take on each iteration of the procedure. We will look into it in detail later as this is out of scope for today’s article

4. Regularization:

It is an extension to our linear model where we seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:

Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).

Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).

Understanding Simple Linear Regression Using jupyter & Python :

We will use jupyter notebook & do all mathematical calculation to plot the simple line of regression below , then will understand it all along the way.

Let’s Get Started :

We will plot a scatter plot to visualize the given arrays x & y and then we will look into plotting a regression line,

Execute the below-given code in your Jupyter notebook(i am assuming that you have already installed anaconda which comes pre-loaded with required python-support & Jupyter IDE)

`#Suppose we have the given value x & y  and there is a linear #relationship between both of them. import numpy as npimport pandas as pdimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsx = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) # number of observations/points n = np.size(x)#Lets plot a scatter plot for the given values colors = np.random.rand(n)area = 100 # 0 to 15 point radiiplt.scatter(x, y, area, colors, alpha=0.5)`

When you run the above code you will see the output as shown below:

Best Fit Line :

Now we need to find the line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset). This line is called the regression line.

The equation of regression line is represented as:

Here,

• y represents the predicted response value for ith observation.
• b0 and b1 are regression coefficients and represent y-intercept and slope of regression line respectively.
• ∈, is the residual error.

To build our simple linear regression model, we need to learn or estimate the values of regression coefficients b0 and b1. These coefficients will be used to build the model to predict responses.

We will make use of Least Squares technique to find the best fit line.

Least squares is a statistical method used to determine the best fit line or the regression line by minimizing the sum of squares created by a mathematical function. The “square” here refers to squaring the distance between a data point and the regression line. The line with the minimum value of the sum of square is the best-fit regression line.

Step2: Calculating slope & intercept

Let’s do some required calculation in python notebook to fins b0, b1. But before that it is important to understand few more formulas which we will perform in the python

`#bo is intercept#b1 is slopeb0= (Σy)(Σx2) - (Σx)(Σxy)/ n(Σx2) - (Σx)2b1=(slope)= n (Σxy) - (Σx)(Σy) /n(Σx2) - (Σx)2`

Execute the below given code in your jupyter notebook to continue,

`#Step2 calculating slope & intercept #b0= (Σy)(Σx2) - (Σx)(Σxy)/ n(Σx2) - (Σx)2#b1=(slope)= n (Σxy) - (Σx)(Σy) /n(Σx2) - (Σx)2#mean of x and y vector m_x, m_y = np.mean(x), np.mean(y)# calculating cross-deviation and deviation about x SS_xy = np.sum(y*x) - n*m_y*m_x SS_xx = np.sum(x*x) - n*m_x*m_x   # calculating regression coefficients b1 = SS_xy / SS_xx b0 = m_y - b1*m_x   print("Coefficient b1 is: ",b1 )print("Coefficient b0 is: ",b0 )`

Run this code and you will see the output as given below:

So we have the required coefficient, b0= 1.23, b1= 1.16

Step3 : Plotting the line of regression:

`#Step 3 : Let's plot the scatter plot along with predicted y value #based on our slope & intercept#plotting the actual points as scatter plot plt.scatter(x, y, color = "m", marker = "o", s = 100) # predicted response vector y_pred = b0 + b1*x# plotting the regression line plt.plot(x, y_pred, color = "g")   # putting labels plt.xlabel('x') plt.ylabel('y')   #show plot plt.show()`

Execute the above code and run , you will see the output as given below in fig:4,

Step 5: Evaluating the model: Using R-Squared

Once we have the simple linear regression line( model ) we need to evaluate the same to measure it’s fitness. We will evaluate the overall fit of a linear model, using the R-squared value

R-Squared:

• R-squared is the proportion of variance explained
• It is the proportion of variance in the observed data that is explained by the model or the reduction in error over the null model
• The null model just predicts the mean of the observed response, and thus it has an intercept and no slope
• R-squared is between 0 and 1
• Higher values are better because it means that more variance is explained by the model.

Calculating R-Squared Using Python:

Horizontal Y.mean() Line

Next, we will place another line on our data. This is a key step in calculating our r-squared, as you will see in a minute. Write the below given code and compile it

`#plot a horizontal line along mean of yline2 = np.full([m-x],[m_y])plt.scatter(x,y)plt.plot(x,line2, c = 'r')plt.show()`

Output is given below in fig:5

Write the below given code and compile to get the r-squared value,

`differences_line1 = y_pred-yline1sum = 0for i in differences_line1: line1sum = line1sum + (i*i)line1sumprint(line1sum)differences_line2 = line2 — yline2sum = 0for i in differences_line2: line2sum = line2sum + (i*i)line2sumprint(line2sum)#Variance of our linear model: 5.624#Total variance of the target variable: 118.5diff = line2sum-line1sumprint(diff)rsquared= diff/line2sumprint(“R-Squared is : “, rsquared)#Let’s Verify The r-squared we calculated by using sklearn “r2_score” function:from sklearn.metrics import r2_scorer2Score = r2_score(y, y_pred)print(“Rsquared usinf sklearn: “, r2Score)#Observationprint("\nAs r-sqaured value is almost close to 1 , we can easily say that our linear regression model, y_pred = b0 + b1*x is a good fit linear regression line.")`

R-Squared value comes out to be : 0.95

Measuring R-Squared:

Higher values are better because it means that more variance is explained by the model.

Observation:

In our case r-squared value has come quite high to almost 0.95 very close to 1. So we can say that our model is better as it explains larger variance in the data.

R-squared Limitations:

You cannot use R-squared to determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

Caution: R-squared does not indicate if a regression model provides an adequate fit to your data. A good model can have a low R2 value. On the other hand, a biased model can have a high R2 value!

So there is one another type of R2: adjusted R-squared and predicted R-squared. These two statistics address particular problems with R-squared. They provide extra information by which you can assess your regression model’s goodness-of-fit. We will cover this later.

Let’s understand the same example:

Using Sklearn Python Library:

Here we can perform the same calculation to find the simple linear regression model using sklearn in just few lines:

Here we go,

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.linear_model import LinearRegressionx = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).reshape(-1,1) y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) #invoke the LinearRegression function and find the bestfit model on #our given dataregression_model = LinearRegression()regression_model.fit(x, y) #this will give the best fit line # Let us explore the coefficients for each of the independent #attributesb1 = regression_model.coef_b0 = regression_model.intercept_print("b1 is: {} and b0 is: {}".format(b1, b0))plt.scatter(x, y, color = "m", marker = "o", s = 100) plt.plot(x, b1*x+b0)`

When you write & compile the above given code snippet you will get the scatter plot with line of regression, as show below:

Let’s Calculate R2 Score:

`#sklearn has a function to calculate R-Squared value as seen belowfrom sklearn.metrics import r2_score#y_pred is the predicted value which our linear regression model #predicted when we plotted the best fit line y_pred= regression_model.predict(x)r2Score = r2_score(y, y_pred) #here y is our original value print(r2Score)`

Output: When you will compile the above code you will get R-squared value to be 0.95 which we also calculated mathematically previously.

Summing Up:

We covered the basics of simple linear regression and understood how we can find the linear regression model with one predictor value X. But there is more to the linear regression. We will not be often dealing with only one predictor value instead we will have large data sets with multiple independent value where you need to deal with multiple linear regression & polynomial type linear regression

What’s Next?

Will cover following topics in the next part of Linear Regression,

• Multiple linear regression model with one case study
• polynomial regression model
• Concept of underfitting & overfitting
• Various techniques of error minimization in linear regression with examples
• Linear regression learning models like gradient descentOLSRegularization

Would like to close this piece of article with a very informative infographics which summarizes how to evaluate models which we will cover in detail in the upcoming series.

Thanks for being with me all along, will be back soon, keep loving keep sharing…