Posted on: August 9, 2020 Posted by: admin Comments: 0
Principal Component Analysis

Implementing Principal Component Analysis Using PCA

In our previous article on Principal Component Analysis, we understood what is the main idea behind PCA.

https://medium.com/swlh/intuition-behind-principal-component-analysis-you-ever-wanted-to-understand-af1b8c1ea801?source=friends_link&sk=1229e30362ea0a2cb34ee9b69330793d

Where we learned :

  • What Is PCA?
  • Why PCA?
  • The intuition behind PCA?
  • Maths Behind PCA
  • Steps To Compute Principal Components and Get The Reduced Dimensions.

As promised in the PCA part 1, it’s time to acquire the practical knowledge of how PCA is implemented using python, using Pandas, Sklearn

Objective:

We will understand how to deal with the problem of curse of dimesnioanlity using PCA and will also see how to implement principal component analysis using python

Data Set:

We will make use of the vehicle-2.csv data set sourced from open-sourced UCI .The data contains features extracted from the silhouette of vehicles in different angles. Four Corgie & model vehicles were used for the experiment: a double-decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van, and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Problem Statement:

The purpose is to classify a given silhouette as one of three types of vehicles, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

How We Will Meet Our Objective?

We will Apply dimensionality reduction technique — PCA and train a model using the reduced set of principal components (Attributes/dimension).

Then we will build Support Vector Classifier on raw data and also on PCA components to see how the model perform on the reduced set of dimension. We will also print the confusion matrix for both the scenario and see how our model has performed in classifying various vehicle types based on the given silhouette of vehicles.

Let’s Go Hands-On:

Import Python Libraries :

The most important library which we will make use of is PCA which is a package available with sklearn package. This has matrix decomposition math library which will do all the magic of linear algebra to extract eigenvalues and eigenvectors for us.

#let us start by importing the relevant libraries
%matplotlib inline
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')
#import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report,roc_auc_score
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

Loading Data Set:

Using the pandas read_csv method we will read our CSV file. To see what are all the columns and its associated values use this loaded data frame and call data frame.head() ,method.

vehdf= pd.read_csv("../input/vehicle-2.csv")
vehdf.head(200)

Label Encoding The Class Variables :

from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
le = LabelEncoder()
columns = vehdf.columns
#Let's Label Encode our class variable:
print(columns)
vehdf['class'] = le.fit_transform(vehdf['class'])
vehdf.shape

Output:

Performing EDA:

  • Finding any missing Value
  • Finding outliers
  • Understanding attributes using descriptive statistics
  • Visualizing attribute distribution using univariate and multivariate analysis
  • Finding attribute correlation and analyzing which attribute is more important

Quick Eyeballing :

To see if there is any missing values.

vehdf.info()

Quick Insights:

We can see that :

  • circularity, class, hollow_ratio,max.length_rectangularity, , max.length_aspect_ratio, compactness has no missing values . Rest all features are having some kind of missing values
  • All attributes are of numerical type

Treating The Missing Value:

Let’s find the count of each attribute & treat the missing values.

We will make use of Imputer library which is equipped to identify all missing values and replace it with median/or mode strategy

from sklearn.impute import SimpleImputer
newdf = vehdf.copy()
X = newdf.iloc[:,0:19] #separting all numercial independent attribute
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
#fill missing values with mean column values
transformed_values = imputer.fit_transform(X)
column = X.columns
print(column)
newdf = pd.DataFrame(transformed_values, columns = column )
newdf.describe()
Let’s print vehdf data frame :
print("Original null value count:", vehdf.isnull().sum())
print("\n\nCount after we impiuted the NaN value: ", newdf.isnull().sum())

Quick Observation:

If you carefully observe above, our original data frame vehdf and new dataframe, df , we will find that, After we imputed the datafarme series, using simpleimputer, we can see that the missing NaN values from our original vehdf dataframe columns are treated and replaced using median strategy.

Understanding each attribute:

Univariate Analysis:

  • Quick descriptive statistics to make some meaningful sense of data
  • Plotting univariate distribution
  • Finding outliers & skewness in data series.
  • Treating outliers

Descriptive statistical summary

describe() function gives the mean, std, and IQR(Inter quartile range) values. It excludes the character column and calculates summary statistics only for numeric columns.

newdf.describe().T

Output:

Quick Insights On Descriptive Stats:

  • Compactness has mean and median values almost similar, it signifies that it is normally distributed and has no skewness/outlier
  • circularity: it also seems to be normally distributed as mean and median has similar values
  • scatter_ratio feature seems to be having some kind of skewness and outlier
  • Scaled variance 1 & 2

Let’s Plot our newdf using seaborn:

plt.style.use('seaborn-whitegrid')
newdf.hist(bins=20, figsize=(60,40), color='lightblue', edgecolor = 'red')
plt.show()

Quick Observation :

  • Most of the data attributes seem to be normally distributed
  • scaled variance 1 and skewness about 1 and 2, scatter_ratio, seems to be right-skewed.
  • pr.axis_rectangularity seems to be having outliers as there are some gaps found in the bar plot.

Let us use seaborn distplot to analyze the distribution of our columns and see the skewness in attributes

f, ax = plt.subplots(1, 6, figsize=(30,5))
vis1 = sns.distplot(newdf["scaled_variance.1"],bins=10, ax= ax[0])
vis2 = sns.distplot(newdf["scaled_variance"],bins=10, ax=ax[1])
vis3 = sns.distplot(newdf["skewness_about.1"],bins=10, ax= ax[2])
vis4 = sns.distplot(newdf["skewness_about"],bins=10, ax=ax[3])
vis6 = sns.distplot(newdf["scatter_ratio"],bins=10, ax=ax[5])
f.savefig('subplot.png')
skewValue = newdf.skew()
print("skewValue of dataframe attributes: ", skewValue)

Univariate Analysis Using Boxplot:

In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

#Summary View of all attribute , The we will look into all the boxplot individually to trace out outliers
ax = sns.boxplot(data=newdf, orient="h")
plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= newdf['pr.axis_aspect_ratio'], color='orange')
plt.subplot(3,3,2)
sns.boxplot(x= newdf.skewness_about, color='purple')
plt.subplot(3,3,3)
sns.boxplot(x= newdf.scaled_variance, color='brown')
plt.show()
plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= newdf['radius_ratio'], color='red')
plt.subplot(3,3,2)
sns.boxplot(x= newdf['scaled_radius_of_gyration.1'], color='lightblue')
plt.subplot(3,3,3)
sns.boxplot(x= newdf['scaled_variance.1'], color='yellow')
plt.show()
plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= newdf['max.length_aspect_ratio'], color='green')
plt.subplot(3,3,2)
sns.boxplot(x= newdf['skewness_about.1'], color='grey')
plt.show()

Observation on boxplots:

  • pr.axis_aspect_ratio, skewness_about, max_length_aspect_ratio, skewness_about_1,
  • scaled_radius_of_gyration.1, scaled_variance.1, radius_ratio, skewness_about, scaled_variance.1 are some of the attributes with outliers. which is visible with all dotted points

Treating Outliers Using IQR: Upper whisker

The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.

newdf.shape
(846, 19)

As we now have the IQR scores, it’s time to get a hold of outliers. The below code will give an output with some true and false values. The data point where we have False means these values are valid whereas True indicates the presence of an outlier.

from scipy.stats import iqr
Q1 = newdf.quantile(0.25)
Q3 = newdf.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
cleandf = newdf[~((newdf < (Q1 - 1.5 * IQR)) |(newdf > (Q3 + 1.5 * IQR))).any(axis=1)]
cleandf.shape
(813, 19)

Let’s Plot The Box Plot Once Agaian To See if outliers are removed.

plt.figure(figsize= (20,15))
plt.subplot(8,8,1)
sns.boxplot(x= cleandf['pr.axis_aspect_ratio'], color='orange')
plt.subplot(8,8,2)
sns.boxplot(x= cleandf.skewness_about, color='purple')
plt.subplot(8,8,3)
sns.boxplot(x= cleandf.scaled_variance, color='brown')
plt.subplot(8,8,4)
sns.boxplot(x= cleandf['radius_ratio'], color='red')
plt.subplot(8,8,5)
sns.boxplot(x= cleandf['scaled_radius_of_gyration.1'], color='lightblue')
plt.subplot(8,8,6)
sns.boxplot(x= cleandf['scaled_variance.1'], color='yellow')
plt.subplot(8,8,7)
sns.boxplot(x= cleandf['max.length_aspect_ratio'], color='lightblue')
plt.subplot(8,8,8)
sns.boxplot(x= cleandf['skewness_about.1'], color='pink')
plt.show()

Quick Comment:

We can see that all out boxplot for all the attributes which had outlier have been treated and removed. Since no. of outliers were less we opted to remove it. Generally, we avoid this as it can lead to info loss in case of large data sets with large no of outliers

Understanding the relationship between all independent attribute:

We will be using data correlation:

Data Correlation: This is a way to understand the relationship between multiple variables and attributes in your dataset.

Using Correlation, you can get some insights such as:

  • One or multiple attributes depend on another attribute or a cause for another attribute.
  • One or multiple attributes are associated with other attributes.

Spearman and Pearson are two statistical methods to calculate the strength of the correlation between two variables or attributes. Pearson Correlation Coefficient can be used with continuous variables that have a linear relationship.

Pearson Correlation Coefficient:

We will use the Pearson Correlation Coefficient to see what all attributes are linearly related and also visualize the same in the seaborn’s scatter plot.

def correlation_heatmap(dataframe,l,w):
#correlations = dataframe.corr()
correlation = dataframe.corr()
plt.figure(figsize=(l,w))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation between different fearures')
plt.show();

# Let's Drop Class column and see the correlation Matrix & Pairplot Before using this dataframe for PCA as PCA should only be perfromed on independent attribute
cleandf= newdf.drop('class', axis=1)
#print("After Dropping: ", cleandf)
correlation_heatmap(cleandf, 30,15)

Quick Insights: From Correlation Heatmap:

Strong/fare Correlation:

  • Scaled Variance & Scaled Variance.1 seems to be strongly correlated with value of 0.98
  • skewness_about_2 and hollow_ratio seems to be strongly correlated, corr coeff: 0.89
  • distance_circularity and radius_ratio seems to have a high positive correlation with correlation coefficient of : 0.81
  • compactness & circularity , radius_ratio & pr.axis_aspect_ratio also seems ver averagely correlated with correlation coeff: 0.67.
  • scaled _variance and scaled_radius_of_gyration, circularity & distance_circularity also seems to be highly correlated with correlation coefficient: 0.79
  • pr.axis_recatngularity and max.length_recatngularity also seems to be strongly correlated with correlation coefficient: 0.81
  • scatter_ratio and elongatedness seems to have a strong negative correlation value : 0.97
  • elongatedness and pr.axis_rectangularity seems to have a strong negative correlation, val: 0.95

Little To No Correlation:

  • max_length_aspect_ratio & radius_ratio have average correlation with coeff: 0.5
  • pr.axis_aspect_ratio & max_length_aspect_ratio seems to have very little correlation
  • scaled_radius_gyration & scaled_radisu_gyration.1 seems to be very little correlated
  • scaled_radius_gyration.1 & skewness_about seems to be very little correlated
  • skewness_about & skewness_about.1 not be correlated
  • skewness_about.1 and skewness_about.2 are not correlated.

let’s visualize the same with pair plot, to see how it looks visually.

Pair plot Analysis:

sns.pairplot(cleandf, diag_kind="kde")

Output:

Quick Insights:

  • As observed in our correlation heatmap our pairplot seems to validate the same. Scaled Variance & Scaled Variance.1 seems to be having a very strong positive correlation with the value of 0.98. skewness_about_2 and hollow_ratio also seems to have a strong positive correlation with coeff: 0.89
  • scatter_ratio and elongatedness seem to have a very strong negative correlation. elongatedness and pr.axis_rectangularity seem to have a strong negative correlation.
  • We found from our pair plot analysis that, Scaled Variance & Scaled Variance.1 and elongatedness and pr.axis_rectangularity to be strongly correlated, so they need to be dropped of treated carefully before we go for model building.

Choosing the right attributes which can be the right choice for model building:

Since our objective is to recognize whether an object is a van or bus or car based on some input features. so our main assumption is there is little or no multicollinearity between the features.

if our dataset has perfectly positive or negative attributes as can be observed from our correlation analysis, there is a high chance that the performance of the model will be impacted by a problem called — “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.

If two features are highly correlated then there is no point using both features, In that case, we can drop one feature. SNS heatmap gives us the correlation matrix where we can see which features are highly correlated.

From the above correlation matrix, we can see that there are many features that are highly correlated. if we carefully analyze, we will find that many features are there which having more than 0.9 correlation. so we can decide to get rid of those columns whose correlation is +-0.9 or above. There are 8 such columns:

  • max.length_rectangularity
  • scaled_radius_of_gyration
  • skewness_about.2
  • scatter_ratio
  • elongatedness
  • pr.axis_rectangularity
  • scaled_variance
  • scaled_variance.1

Outcome:

  • Also, we observed that more than 50 % of our attributes are highly correlated, so what we can we do best to deal with this kind problem of Multicollinearity
  • Well, there are multiple ways to deal with this problem. The easiest way is to delete or eliminate one of the perfectly correlated features. We can pick one of the two highly correlated variables and drop another one. like in our case Scaled Variance & Scaled Variance.1 are having a strong positive correlation, so we can pick one and drop one as they will only make our dimension redundant.
  • Similarly, between elongatedness and pr.axis_rectangularity we can pick one as they have a very strong negative correlation. This approach can be used to select the feature we want to carry forward for model analysis. But there is another better approach called PCA

Another method is to use a dimension reduction algorithm such as Principle Component Analysis (PCA). We will go for PCA and analyze the same going forward:

Principal Component Analysis(PCA):

Basically PCA is a dimension reduction methodology that aims to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which holds sufficient information without losing the relevant info much.

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

We will perform PCA in the following steps:

  • Split our data into train and test data set
  • normalize the training set using standard scalar
  • Calculate the covariance matrix.
  • Calculate the eigenvectors and their eigenvalues.
  • Sort the eigenvectors according to their eigenvalues in descending order.
  • Choose the first K eigenvectors (where k is the dimension we’d like to end up with).
  • Build new features with reduced dimensionality.

Approach 1:

Separate The Data Into Independent & Dependent attribute

#now separate the dataframe into dependent and independent variables
#X1= newdf.drop('class',axis=1)
#y1 = newdf['class']
#print("shape of new_vehicle_df_independent_attr::",X.shape)
#print("shape of new_vehicle_df_dependent_attr::",y.shape)
X = newdf.iloc[:,0:18].values
y = newdf.iloc[:,18].values
X

Output:

array([[ 95.,  48.,  83., ...,  16., 187., 197.],
[ 91., 41., 84., ..., 14., 189., 199.],
[104., 50., 106., ..., 9., 188., 196.],
...,
[106., 54., 101., ..., 4., 187., 201.],
[ 86., 36., 78., ..., 25., 190., 195.],
[ 85., 36., 66., ..., 18., 186., 190.]])

Scaling The Independent Data Set:

We transform (centralize) the entire X (independent variable data) to normalize it using standardscalar(), through transformation. It helps to bring all our features to same scale so that each one of them has at the time of feature analysis using PCA.

from sklearn.preprocessing import StandardScaler
# on this distribution. 
sc = StandardScaler()
X_std = sc.fit_transform(X)

Calculating the covariance matrix:

Covariance matrix should be 18*18 matrix

cov_matrix = np.cov(X_std.T)
print("cov_matrix shape:",cov_matrix.shape)
print("Covariance_matrix",cov_matrix)

Calculating Eigen Vectors & Eigen Values: Using NumPy linear algebra function

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)

Sort eigenvalues in descending order:

# Make a set of (eigenvalue, eigenvector) pairs:
eig_pairs = [(eigenvalues[index], eigenvectors[:,index]) for index in range(len(eigenvalues))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eigenvalues))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eigenvalues))]
# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)

Calculating variance explained in percentage:

tot = sum(eigenvalues)
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)] # an array of variance explained by each
# eigen vector... there will be 18 entries as there are 18 eigen vectors)
cum_var_exp = np.cumsum(var_explained) # an array of cumulative variance. There will be 18 entries with 18 th entry
# cumulative reaching almost 100%

Plotting The Explained Variance and Principal Components:

plt.bar(range(1,19), var_explained, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(1,19),cum_var_exp, where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()

Quick Observation:

- From above we plot we can clealry observer that 8 dimension() are able to explain 95 %variance of data. - so we will use first 8 principal components going forward and calulate the reduced dimensions.

Dimensionality Reduction

Now 8 dimensions seem very reasonable. With 8 variables we can explain over 95% of the variation in the original data!

# P_reduce represents reduced mathematical space....
P_reduce = np.array(eigvectors_sorted[0:8])   # Reducing from 8 to 4 dimension space
X_std_8D = np.dot(X_std,P_reduce.T)   # projecting original data into principal component dimensions
reduced_pca = pd.DataFrame(X_std_8D)  # converting array to dataframe for pairplot
reduced_pca

Output:

0123456700.3341620.219026–1.001584–0.176612–0.079301–0.757447–0.9011240.3811061–1.5917110.4206030.369034–0.233234–0.693949–0.5171620.378637–0.24705923.769324–0.195283–0.087859–1.202212–0.7317320.705041–0.034584–0.4827723–1.7385982.829692–0.109456–0.3766850.362897–0.4844310.4707530.02308640.558103–4.758422–......
846 rows × 8 columns

Let us check The Pairplot Of Reduced Dimension After PCA:

sns.pairplot(reduced_pca, diag_kind='kde') 
#sns.pairplot(reduced_pca1, diag_kind='kde')

Output:

<seaborn.axisgrid.PairGrid at 0x7f5109c32668>

It is clearly visible from the pair plot analysis, above that:

After dimensionality reduction using PCA, our attributes have become independent with no correlation among themselves. As most of them have a cloud of data points with no linear kind of relationship.

Fitting Model and measuring score simply on Original Data :

Fit SVC Model On Train-test Data:

Let’s build two Support Vector Classifier Model one with 18 original independent variables and the second one with only the 8 new reduced variables constructed using PCA.

#now split the data into 70:30 ratio
#orginal Data
Orig_X_train,Orig_X_test,Orig_y_train,Orig_y_test = train_test_split(X_std,y,test_size=0.30,random_state=1)
#PCA Data
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(reduced_pca,y,test_size=0.30,random_state=1)
#pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(reduced_pca1,y,test_size=0.30,random_state=1)

Let’s train the model with both original data and PCA data with a new dimension

Fitting SVC model On Original Data

svc = SVC() #instantiate the object
#fit the model on orighinal raw data
svc.fit(Orig_X_train,Orig_y_train)

Output:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

Calculating Y Value:

#predict the y value
Orig_y_predict = svc.predict(Orig_X_test)

Fitting SVC ON PCA Data:

#now fit the model on pca data with new dimension
svc1 = SVC() #instantiate the object
svc1.fit(pca_X_train,pca_y_train)
#predict the y value
pca_y_predict = svc1.predict(pca_X_test)
#display accuracy score of both models
print("Model Score On Original Data ",svc.score(Orig_X_test, Orig_y_test))
print("Model Score On Reduced PCA Dimension ",svc1.score(pca_X_test, pca_y_test))
print("Before PCA On Original 18 Dimension",accuracy_score(Orig_y_test,Orig_y_predict))
print("After PCA(On 8 dimension)",accuracy_score(pca_y_test,pca_y_predict))

Model Scores:

  • Model Score On Original Data 0.952755905511811
  • Model Score On Reduced PCA Dimension 0.9330708661417323
  • Before PCA On Original 18 Dimension 0.952755905511811
  • After PCA(On 8 dimension) 0.9330708661417323

Quick Observation:

  • On training data set we saw that our support vector classifier without performing PCA has an accuracy score of 95 %
  • But when we applied the SVC model on PCA components(reduced dimensions) our model scored 93 %.
  • Considering that the original dataframe had 18 dimensions and After the PCA dimension reduced to 8, our model has fared well in terms of accuracy score.

Confusion Matrix:

# Calculate Confusion Matrix & PLot To Visualize it
def draw_confmatrix(y_test, yhat, str1, str2, str3, datatype ):
#Make predictions and evalute
#model_pred = fit_test_model(model,X_train, y_train, X_test)
cm = confusion_matrix( y_test, yhat, [0,1,2] )
print("Confusion Matrix For :", "\n",datatype,cm )
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [str1, str2,str3] , yticklabels = [str1, str2,str3] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
draw_confmatrix(Orig_y_test, Orig_y_predict,"Van ", "Car ", "Bus", "Original Data Set" )
draw_confmatrix(pca_y_test, pca_y_predict,"Van ", "Car ", "Bus", "For Reduced Dimensions Using PCA ")
#Classification Report Of Model built on Raw Data
print("Classification Report For Raw Data:", "\n", classification_report(Orig_y_test,Orig_y_predict))
#Classification Report Of Model built on Principal Components:
print("Classification Report For PCA:","\n", classification_report(pca_y_test,pca_y_predict))
Confusion Matrix For : 
Original Data Set [[ 58 0 1]
[ 1 129 3]
[ 6 1 55]]
Confusion Matrix For : 
For Reduced Dimensions Using PCA [[ 57 2 0]
[ 2 126 5]
[ 1 7 54]]
Classification Report For Raw Data: 
precision recall f1-score support
0.0       0.89      0.98      0.94        59
1.0 0.99 0.97 0.98 133
2.0 0.93 0.89 0.91 62
accuracy                           0.95       254
macro avg 0.94 0.95 0.94 254
weighted avg 0.95 0.95 0.95 254
Classification Report For PCA: 
precision recall f1-score support
0.0       0.95      0.97      0.96        59
1.0 0.93 0.95 0.94 133
2.0 0.92 0.87 0.89 62
accuracy                           0.93       254
macro avg 0.93 0.93 0.93 254
weighted avg 0.93 0.93 0.93 254

Quick Comments: Confusion Metric Analysis ON Original Data:

Confusion Matrix For : Original Data Set [[ 58 0 1] [ 1 129 3] [ 6 1 55]]:

– Our model on the original data set has correctly classified 58 vans out of 59 actuals vans and has errored only in one case where it has wrongly predicted van to be a bus. 
 — In the case of 133 actual cars, our SVM model has correctly classified 129 cars. it has wrongly classified 3 cars to be a bus and also 1 car to be a van
 — In the case of 62 instances of the actual bus, our model has correctly classified 55 buses, It has faltered in classifying wrongly 6 buses to be a van and one bus to be a car.

Confusion Metric Analysis ON Reduced Dimension After PCA :

For Reduced Dimensions Using PCA: [[ 57 2 0] [ 2 126 5] [ 1 7 54]]

– Out of 59 actual instances of vans our model has correctly predicted 57 vans and errored in 2 instances where it wrongly classified vans to be a car. 
– Out of 133 actuals cars, our model has correctly classified 126 of them to be a car and faltered in 7 cases where it wrongly classified 5 cars to a bus and 2 cars to be a van. 
– Out of 62 actual bus, our model has correctly classified 54 of them to be a bus. It has faltered in 8 cases where it wrongly classified 7 bus to be a car and 1 bus to be a van.

Let’s See The Classification Report Metrics:

Classification Report For Raw Data: precision-recall f1-score support

0.0       0.89      0.98      0.94        59     1.0       0.99      0.97      0.98       133     2.0       0.93      0.89      0.91        62
  • micro avg 0.95 0.95 0.95 254
  • macro avg 0.94 0.95 0.94 254
  • weighted avg 0.95 0.95 0.95 254

Classification Report For PCA: precision-recall f1-score support

0.0       0.95      0.97      0.96        59     1.0       0.93      0.95      0.94       133     2.0       0.92      0.87      0.89        62
  • micro avg 0.93 0.93 0.93 254
  • macro avg 0.93 0.93 0.93 254
  • weighted avg 0.93 0.93 0.93 254

Insights On Classification Reports: On original data:

  • Our model has a 99 % precision score when it comes to classifying cars from the given set of silhouette parameters.
  • It has 89 % precision when it comes to classifying the input as a van, while it has 93 % precision when it comes to predicting data as a bus.
  • In terms of recall score, our model has a recall score of 98 % for van classification, 97 % for car, and 89 % for the bus.
  • Our model has a weighted average of 95 % for all classification metrics.

On Reduced Dimensions After PCA:

  • Our model has the highest precision score of 95 % when it comes to predicting van type, which is better as compared to prediction done on the original data set, which came out with a precision score of 89 % for the van.
  • The recall score is almost neck to neck with what our model scored on the original data set. It showed the highest recall score of 97 % in classifying data as a car.

Summary :

We learned how PCA is performed and also understood that even though we have reduced the dimensions using PCA analysis, we managed to score on our supervised model with a result which was at par with the score achieved on feature set with all the original dimensions. This establishes the value Principal component analysis as a tool has to offer to all the Data scientist.

Food for thought:

“ When great teamwork happens you end up achieving the impossible. Here PCA as an unsupervised learning tool can greatly team up with our Supervised learning models to solve complex problems with utter ease. ”

Hope you all enjoyed this hands-on python lab to learn PCA. ….

Leave a Comment