So when it comes to understanding the data it becomes extremely important that we are not only looking to extract obvious insights but also to identify the hidden patterns which may not be easy to find just by exploratory data analysis. To make intelligent predictions, identifying patterns and make effective recommendations our data need to be segregated into meaningful clusters. This stream of machine learning where we do not rely on a labeled data set which has a target variable already defined instead rely upon clustering the data sets into groups and try to make predictions about the behavior is called unsupervised learning.

Unsupervised learning collaborates with supervised machine learning to make our model robust and reliable. So today we will look into unsupervised learning techniques, we will go into details of

- What is Unsupervised Learning?
- Types Of Unsupervised Learning
- Understanding clustering & its types
- Hands-on on K-Means & hierarchical clustering

Let’s start this journey of learning by understanding unsupervised learning.

#### What Is Unsupervised Learning?

It’s a learning process where our machine has the unlabeled data set to be processed and it has to find the unknown/unseen pattern in the given data set.

*It’s is a kind of self-organized **Hebbian learning** that helps the machine to find previously unknown patterns in data set without pre-existing labels.*

Here in the pic, you can see that the first robot has got the data set where the required result to be tested & predicted is known to him. This is what we call **supervised learning** with labeled data. Whereas in the second case our robot is speaking about how it doesn’t have any clue about what is an outcome he is trying to achieve with the given data set. Here there is no pre-labeled data set fed to the robot, this is what we call **Unsupervised learning.**

Unlike supervised learning where we have labeled data that needs to be classified or predicted, here we don’t have any target variable which is used to measure the model outcome. Our machine self-learn from the unlabeled data set & and allows modeling probability densities of the given input.

But the question is how does our machine learn in this scenario? Well, here what we do is that we group the given data set where every group has some kind of pattern or common behavior amongst them. Let’s get into the depth of it by understanding what is clustering & types of unsupervised learning.

### Unsupervised Learning Type:

Two major types of unsupervised learning methodology are:

*1. *Principal Component Analysis

*2. *Cluster Analysis(Clustering)

#### A. Principal Component Analysis:

I have covered PCA in-depth in my previous article, attaching the link for you all to refer and understand the concept of the same:

#### B. Cluster Analysis(Clustering):

Clustering is a methodology where we group or segment the given datasets where each cluster/group of data has shared attributes to extrapolate algorithmic relationships.

In this machine learning technique, the cluster of the data is not labeled, classified or categorized. The clustering of data is done based on the similarity of the feature.

*Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.*

#### As per Wiki,

Clustering divides objects based on unknown features. The machine chooses the best way

Going further, we will mainly look into clustering techniques and associated algorithm with it, then we will look K-Means in detail and go hands-on with one problem statement using python & Jupyter notebook

#### Some of the typical Use Cases To Make Things More Clearer:

Nowadays used:

- For market segmentation (types of customers, loyalty)
- To merge close points on a map
- For image compression
- To analyze and label new data
- To detect abnormal behavior

Going further, we will mainly look into clustering techniques and associated algorithms with it, then we will look at K-Means in detail and go hands-on with one problem statement using python & Jupyter notebook.

**For example,** you must have used Apple Photos and Google Photos in your phone, here a more complex level of clustering is involved. Your pics are clubbed based on all the faces in photos to create albums of your friends. The concerned app doesn’t know how many friends you have and how they look, but it’s trying to find the common facial features, it is a typical case of unsupervised learning analysis, which involves clustering.

Similarly, e-Commerce giants like amazon used customer segmentation to offer product recommendations based on group behavior and similarly to the attributes of the consumers in that particular group.

#### Another visual example below to help you understand the clustering mechanism:

### Clustering Techniques:

3 major type of clustering techniques are:

**Hierarchical Clustering****Partitional Clustering****Bayesian Clustering**

The below diagram gives the details of the type of clustering mechanism employed in unsupervised learning.

We will mainly look into two of them in today’s journey.

- Hierarchical Clustering
- K-Means Clustering

### A. Hierarchical Clustering :

Hierarchical clustering is an algorithm that groups similar objects into a cluster where each successive cluster is formed based on the previously established cluster. The endpoint is a set of clusters*, *where each cluster is different & distinct from each other, and also the attributes within each cluster are broadly similar to each other.

Strategies for hierarchical clustering generally fall into two types

**Agglomerative**: This is a bottom-up approach: where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.**Divisive**: This is a top-down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

### B. Partitional Clustering:

These clustering method partitions the objects into k clusters and each partition form one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon randomized Search), etc. This mechanism determines all the cluster at once.

Partitional clustering decomposes a data set into a set of disjoint clusters. Given a data set of N points, a partitioning method constructs K (N ≥ K) partitions of the data, with each partition representing a cluster. That is, it classifies the data into K groups by satisfying the following requirements: (1) each group contains at least one point, and (2) each point belongs to exactly one group. Notice that for fuzzy partitioning, a point can belong to more than one group.

One of the popular partitional clustering methods is

#### K-Means Clustering:

k-means clustering is a method to partition n number* *o*f *observations into *k* clusters where each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

*The outcome of the K-means clustering algorithm is:*

- The centroids of the
*K*clusters, which can be used to label new data - Labels for the training data (each data point is assigned to a single cluster)

#### How K-Means Algo Works:

It aims to partition a set of observations into several clusters (k), resulting in the partitioning of the data into Voronoi cells. It can be considered a method of finding out which group a certain object belongs to.

*This algorithm works in the following manner:*

- K points are placed into the object data space representing the initial group of centroids.
- Each object or data point is assigned to the closest k.
- After all, objects are assigned, the positions of the k centroids are recalculated.
- Steps 2 and 3 are repeated until the positions of the centroids no longer move.

### Understanding Hierarchical Clustering In Detail With Example:

#### A. Agglomerative Clustering :

It is a Bottom-up hierarchical clustering algorithm, where we treat each data point as a single cluster to start with and then successively merge (or *agglomerate*) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. While we do this process of merging the cluster based on similarity we end up with a tree called a *dendrogram*

*The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample*

Agglomerative Hierarchical Clustering

To understand how the above merging happens we need to understand the following key concepts:

#### Measures of distance (similarity)

The distance between two given clusters is computed using the distance metrics like Euclidean distance. There are other distance metrics too, like Euclidean squared, Manhattan, etc., but the choice of a distance metric purely depends on the domain and the business problem statement which one is trying to solve.

#### Linkage Criteria:

Once we decide to select the right distance metric, one needs to decide from which point distance is being calculated. We achieve this by using various types of linkage options. H*ow *the pairs merge in Agglomerative clustering, involves calculating a dissimilarity between each merged pair and the other sample, which is achieved by the linkage criteria.

**Single Linkage:**

Here the distance between two clusters is the distance between the nearest neighbors in those clusters.

**Example:** The distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.

In single-link clustering, we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

**Complete Linkage:**

**Example:** the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points

In this type of clustering (it is also called the *diameter* or *maximum* method), we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.

Here the distance between two clusters is the distance between the furthest points in those clusters.

**Average Linkage:**

In *average-link* clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

**Example**: the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.

Here the distance between two clusters is the average of the distances between all the points in those clusters.

**Centroid-linkage**: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.

#### Divisive Clustering :

It is a top-down clustering method and is less commonly used. It works similarly to agglomerative clustering but, in the opposite direction.

*Here, all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.*

**To Summarize: Here**

- Data starts as one single combined cluster.
- The cluster splits into two distinct parts, based on some degree of similarity.
- Clusters split into two again and again until the clusters only contain a single data point.

Divisive Clustering method is rigid i.e., once a merging or splitting is done, it can never be undone

### Hands-on With Hierarchical Clustering Using Python:

We will make use of Wholesale customer Data sets. This file has been sourced from UCI: https://archive.ics.uci.edu/ml/datasets/wholesale+customers

The data set refers to the clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

*Note! **We expect that you have installed anaconda from **here**, to use Jupyter as your code editor. we will be making use of Python version 3 which will come pre-loaded in anaconda installation.*

### Attribute Information:

- FRESH: annual spending (m.u.) on fresh products (Continuous)
- MILK: annual spending (m.u.) on milk products (Continuous)
- GROCERY: annual spending (m.u.)on grocery products (Continuous)
- FROZEN: annual spending (m.u.)on frozen products (Continuous)
- DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
- DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
- CHANNEL: customers Channel — Horeca (1: Hotel/Restaurant/Cafe) or Retail channel (Nominal).

### Objective:

We will try to cluster this given dataset into a customer segment, which is achieved based on the purchase made by various consumers. The intent is to cluster similar customers together using a hierarchical clustering technique.

#### Importing Packages:

importpandasaspdimportnumpyasnpimportmatplotlib.pyplotasplt

%matplotlibinline

importseabornassnsimportwarnings

importitertools

warnings.filterwarnings("ignore")

warnings.simplefilter(action='ignore', category=FutureWarning)

### Loading The Data Set into Pandas Dataframe:

#loading the data set

ws_df = pd.read_csv('Wholesalecustomers_data.csv')

ws_df.head(100)

Here is how the output will look like:

### Dropping The Data Set:

As we will not be requiring channel & region attributes in our cluster analysis let’s drop it from our data frame using the panda’s drop method.

ws_df.drop(labels=(['Channel','Region']),axis=1,inplace=True)

ws_df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 440 entries, 0 to 439

Data columns (total 6 columns):

Fresh 440 non-null int64

Milk 440 non-null int64

Grocery 440 non-null int64

Frozen 440 non-null int64

Detergents_Paper 440 non-null int64

Delicassen 440 non-null int64

dtypes: int64(6)

memory usage: 20.7 KB

#### Let’s See The Shape Our Dataframe:

ws_df.shape

output:

(440, 6)

#Our Dataframe is of size 440 rows and 6 columns.

### Quick Comments:

As all the attributes above are non-null and have equal no of rows, there seem to be no missing values. All the attributes are of numerical type.

### EDA: Exploratory Data Analysis:

Let’s try to eyeball all the attributes and analyze :

- How it is distributed?
- What is the statistical 5 point summary?
- And to find if there are any outliers in the given data set etc.

### Descriptive Statistics:

ws_df.describe().T

Out[6]:

### Quick Observation On Descriptive Statistics Output:

- Fresh, Milk, grocery, Delicassen all have the mean and median values which are differing a lot. It means the data is skewed and is not normally distributed
- None of the attributes seems to have a normal distribution
- There seems to be skewness in the data attributes

### Univariate Data visualization of attributes:

importitertools

attr_col = [iforiinws_df.columnsifinotin'strength']

length = len(attr_col)

cs = ["b","r","g","c","m","k"]

fig = plt.figure(figsize=(13,25))

fori,j,kinitertools.zip_longest(attr_col,range(length),cs):

plt.subplot(4,2,j+1)

ax = sns.distplot(ws_df[i],color=k,rug=True)

ax.set_facecolor("w")

plt.axvline(ws_df[i].mean(),linestyle="dashed",label="mean",color="k")

plt.legend(loc="best")

plt.title(i,color="navy")

plt.xlabel("")

Output:

### Quick Observation:

As can be seen above almost all the attributes are highly skewed with long positive tails on the right side.

### Identifying Outliers: Using BoxPLot Analysis

ax = sns.boxplot(data=ws_df, orient="h")

### Quick Observation On Outlier:

As can be seen from the above there seem to be loads of outliers in almost every attribute. Let’s Normalize the dataset using a **sklearn’s** normalize method.

fromsklearn.preprocessingimportnormalize

X_std = normalize(ws_df)

X_std = pd.DataFrame(X_std, columns=ws_df.columns)

X_std.head()

Output:

### Hierarchical Clustering Using SciPy Package:

### Plotting Dendrograms:

importscipy.cluster.hierarchyasshc

plt.figure(figsize=(15, 10))

plt.title("Dendrograms")

dend = shc.dendrogram(shc.linkage(X_std, method='ward'))

Output:

### Plotting Dendrogram Cutt-Off

The X-axis depicts data points and Y-axis represents the distance between these samples. We can see that the vertical line with maximum distance is the blue line, which helps us to decide a threshold of 6 and cut the dendrogram:

plt.figure(figsize=(15, 10))

plt.title("Dendrograms")

dend = shc.dendrogram(shc.linkage(X_std, method='ward'))

plt.axhline(y=6, color='y', linestyle='-')

#### Output:

<matplotlib.lines.Line2D at 0x1a2065d978>

### Let’s CLuster Our Data Using Agglomerative Clustering:

fromsklearn.clusterimportAgglomerativeClustering

agg_clu = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')

agg_clu.fit_predict(X_std)

Output:

array([1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,

0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,

1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,

1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,

0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,

0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,

0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,

0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,

0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,

0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0,

0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,

0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,

1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,

0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,

0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,

1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,

0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,

1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,

1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1])

### Cluster Visualization:

Let’s visualize Milk & Grocery Data clusters using

plt.figure(figsize=(15, 10))

plt.scatter(X_std['Milk'], X_std['Grocery'], c=agg_clu.labels_)

#### Output:

<matplotlib.collections.PathCollection at 0x1a200162e8>

So you can clearly see from the above plot that we have managed to segregate milk and grocery data set as two separate clusters using unsupervised machine learning techniques called hierarchical clustering.