Posted on: July 21, 2020 Posted by: admin Comments: 0
data visualization using python

I Feel:

In today’s digital world data has become as important as air.Machines & humans both are literally breathing in & breathing out data data and data….

People are consuming and generating huge volumes of data knowingly and unknowingly on a daily basis. It is this bombardment of digital information is what current businesses are trying to tap and harness to sell and engage their customers more. All types of Industries are bringing a personal touch into their services and offerings to give awesome user experience to their customers. All these have become possible due to powerful Data science enabled AI/ML techniques which are empowering our machines, allowing them to take analytical decisions based on a sea of data accessible to them.

In order to analyze this huge data sets our machines make use of some really powerful data visualization packages built in Python. So we will try to capture

1. What Is Data Visualization?

2. What are the Data Visualization Packages?

3. How To Use Them?

4. Why You Should Learn Them?

In this series on Data visualization using pythin which we will brak in many parts.

Data Visualization In Data Science:

As we know our human mind is trained to understand more by images. So the saying goes “A picture is worth a thousand words”. This is completely relevant when you are learning Data science. You will be dealing with large volume of data sets which needs visual expression to make some sense in deducing valuable hidden patterns.

Data visualization is a technique in data science field, allowing you to tell a compelling story, visualizing data and findings in an approachable and stimulating way. It makes complex data look simple and easy to understand.

Data Visualization Tools:

We will try to cover some of the popular data visualization tools givens below

  1. Matplotlib
  2. Seaborn
  3. Plotly
  4. Pandas

Learning how to leverage these software tool to visualize data will help you make sense of data , extract meaningful information and plot it visually to make more effective data driven decision.

So let’s get started with Matplotlib which we will cover in today’s piece of article, rest we will cover in an upcoming series of Data visualization.

A: Matplotlib:

As per official Matplotlib Portal:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Matpotlib is widely used tool for data visualization, which works great at low-level with a Matlab like UI interface and offers you lot of flexibility in terms of writing code, yes, it can be sometime tedious writing more codes but it is worth with the kind of freedom it gives.

Installing Matpotlib:

  1. Using PIP:
python -m pip install -U pip
python -m pip install -U matplotlib

2. Using Scientific Python Distribution :

There are many third party scientific distributions like

Anaconda is my personal favorite , it is one of the popular python data science distribution which gives you hassle free installation of all data science related packages and comes pre-loaded with Numpy, SciPy, Pandas, Matplotlib, Plotly, etc. I would recommend you all to install it and you will be all set in few seconds.

You can install any package using conda command prompt/terminal by using conda terminal, though you need to visit to the package official site to get the exact command format.

conda install PackageName

For Matpotlib:

conda install matplotlib

Various types of data visualization matpolotlib provides are :

  1. Lines, bars and markers
  2. Images, contours & fields
  3. Pie & polar charts
  4. Statistical level Plotting

& many more..

They are widely used for line chart, bar chart, histogram, pie-chart etc..

For detail visit gallery section by clicking on the link belowGallery – Matplotlib 3.1.0 documentation
This gallery contains examples of the many things you can do with Matplotlib. Click on any image to see the full image…

Plotting With MatplotLib: Let’s Learn By Examples:

As discussed, Matplotlib facilitates various kinds of plot ranging from scatter plots, to bar charts, to histogram. The selection is totally contextual and is made based on our data visualization requirements like group comparison, comparing two quantitative variables to each other, or to understand data distribution etc.

We will cover few popular plotting techniques here:

Basic Requirements :

Before we start getting our hands dirty with some real examples, we need to be ready with few installations :

Install Anaconda Distribution:

1. First, you need to ensure anaconda is installed :

Use the given link below to learn the installation process. It is easy and you can get started in few seconds:Installation – Anaconda 2.0 documentation
On Windows, macOS, and Linux, it is best to install Anaconda for the local user, which does not require administrator…

Launch Jupyter Notebook:

Once you are done with installation of anaconda distribution, open the anaconda navigator on your computer and launch Jupyter notebook as shown in the image below. We will be using Jupyter notebook to code our examples.

Check for the Pre-requisite Package Installation:

Refer the below given image: Go to Environments menu option and you will see various pre-installed packages on the right. For eg. Search for Pandas and you will see that is pre-installed, similarly you can type in the required package and discover them to install if not already installed though Anaconda Navigator. Check and ensure matplotlib, numpy, pandas, seaborn etc are pre-installed and install them if it is not installed.

Once you are done with required package installation, let’s get started with our first plot called Bar Chart:

Some Key Points About Matplotlibs To Be Remembered:

Matplotlib has a important module called pyplot, which aids in plotting figure. The Jupyter notebook can be used for running the plots, it gives hassle free experience and is easy to get started . We have to import matplotlib.pyplot as plt for making it call the package module.

  • You can Import required libraries and dataset to plot using Pandas pd.read_csv()
  • Use plt.plot()for plotting line chart similarly in place of plot other functions are used for plotting. All plotting functions require data and it is provided in the function through parameters.
  • Useplot.xlabel , plt.ylabel for labeling x and y-axis respectively.
  • Useplt.xticks , plt.yticks for labeling x and y-axis observation tick points respectively.
  • Use plt.legend() for signifying the observation variables.
  • Use plt.title() for setting the title of the plot.
  • for displaying the plot.

1. Bar Chart Plotting:

Bar Plotting Example :

#Here we import ther matplotlib package with alias name as plt
import matplotlib.pyplot as plt[1,3,5,7,9],[5,2,7,8,2], label=”Example one”)[2,4,6,8,10],[8,6,2,5,6], label=”Example two”, color=’g’)
plt.xlabel(‘bar number’)
plt.ylabel(‘bar height’)
plt.title(‘Wow! We Got Our First Bar Graph’)

Copy the above code and paste it in your Jupyter notebook, run it and you will be able to see the bar plot visuals as shown below:


After we import matplotlib data visualization package its submodule pyplot has got this bar method which helps you plot a basic bar graph ;

Here plt. bar method can be better understood by the explanation given below., height, width=0.8, bottom=None, *, align='center', data=None, **kwargs)[source]
So to Make a bar plot: 
The bars are positioned at x with the given alignment. Their dimensions are given by width and height. The vertical baseline is bottom(default 0).
Each of x, height, width, and bottom may either be a scalar applying to all bars, or it may be a sequence of length N providing a separate value for each bar.

For more detail – Matplotlib 3.1.0 documentation
The optional arguments color, edgecolor, linewidth, xerr, and yerr can be either scalars or sequences of length equal…

2. Histrogram:

A histogram is a plot of the frequency distribution of numeric array by splitting it into small equal-sized bins.

Histograms are used to estimate the distribution of the data, with the frequency of values assigned to a value range called a bin.

If you want to mathematically split a given array to bins and frequencies, use the numpy’s histogram() method . If you want to measure distribution of numeric values you can do so with .hist() plot method to create a simple histogram

Matplotlib provides the functionality to visualize Python histograms out of the box with a versatile wrapper around NumPy’s histogram():


#Histogram Code 
import matplotlib.pyplot as plt
import numpy as np #importing numpy package for array generation

>>> d = np.random.laplace(loc=15, scale=3, size=500)
>>> d[:5]
# An "interface" to matplotlib.axes.Axes.hist() method
n, bins, patches = plt.hist(x=d, bins='auto', color='#0504aa',
alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.title('My First Histogram Ever')
plt.text(23, 45, r'$\mu=15, b=3$')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)


The pyplot.hist() in matplotlib lets you draw the histogram. It required the array as the required input and you can specify the number of bins needed. A plot of a histogram uses its bin edges on the x-axis and the corresponding frequencies on the y-axis. In the chart above, passing bins='auto' chooses between two algorithms to estimate the “ideal” number of bins. At a high level, the goal of the algorithm is to choose a bin width that generates the most faithful representation of the data.

Output of source code: #Histogram Code mentioned above:

3. Scatter Plot:

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example, weight and height, weight would be on y axis and height would be on the x axis. Correlations may be positive (rising), negative (falling), or null (uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation.

Scatter plot Method format :

matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, *, plotnonfinite=False, data=None, **kwargs)[source]

x, y : array_like, shape (n, )

The data positions.

s : scalar or array_like, shape (n, ), optional

The marker size in points**2. Default is rcParams['lines.markersize'] ** 2.

c : color, sequence, or sequence of color, optional

For more detail about using the scatter plot method please refer the given link below:matplotlib.pyplot.scatter – Matplotlib 3.1.0 documentation

Scatter Plot Example:

#scatter plot lib example using matplotlb
import numpy as np
import matplotlib.pyplot as plt
# Create data
N = 100
x = np.random.rand(N)
y = np.random.rand(N)
colors = (0,100,255)
area = np.pi*3
# Plot
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.title('Scatter plot example using matplotlib')

Compile the code on your jupyter notepad and you will see the outcome as given below:

Understanding Data Visualization Through Real Data Sets :

We will be using the automobile data set, which we have downloaded from kaggle, to understand data visualization using MatplotLib:Automobile Dataset
Dataset consist of various characteristic of an

Always Remember:

  1. Download the Automobile.csv file from the above link
  2. Upload the file Jupyter into your working directory where your current code files lie.

3. Plotting Histogram : Using grouping data categorically :

We can have multiple histogram plots in the same plot. This helps you to compare the distribution of a continuous variable grouped by different categories.

To understand it, we will be using Automobile.csv data sets:

Reading Data Sets:

import pandas as pd
#Reading data frm the automobile #data sets using pandas read method
df = pd.read_csv(‘Automobile.csv’) 
#When you compile this code you will see the below given o/p as a series of data column wise. 

Let’s compare the distribution of car horsepower for different type of car make in above given data set of Automobile.csv

Write/Copy-paste below given code in your jupyter notebook file:

import matplotlib.pyplot as plt
#is you don't want to make a regular call on use this line
%matplotlib inline
x1 = df.loc[df.make=='alfa-romero', 'horsepower']
x2 = df.loc[df.make=='audi', 'horsepower']
x3 = df.loc[df.make=='bmw', 'horsepower']
x4 = df.loc[df.make=='toyota', 'horsepower']
x5 = df.loc[df.make=='volvo', 'horsepower']
kwargs = dict(alpha=0.9, bins=100)
plt.hist(x1, **kwargs, color='g', label='alfa-romero')
plt.hist(x2, **kwargs, color='b', label='audi')
plt.hist(x3, **kwargs, color='r', label='bmw')
plt.hist(x4, **kwargs, color='y', label='toyota')
plt.hist(x5, **kwargs, color='y', label='volvo')
plt.gca().set(title='Horse power Varitation for various make of a car', ylabel='Frequency')

Below is a histogram Plot plotted against the given set of values using

You can clearly make out that the larger concentration of horsepower lies between 110–120 hp .

Scatter Plot :

Let’s plot a data distribution using scatter plot. Here we will try to see price distribution based on body_style of car .

Copy /Paste the below given code in your Jupyter notebook and compile it

# Scatter Plot
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
df = pd.read_csv(‘Automobile.csv’)
bodystyle = df[‘body_style’] #fetching bodytype values r
price = df[‘price’] #fetching price for different body type
plt.scatter(bodystyle, price, edgecolors=’r’)
plt.xlabel(‘body_style’, 'make')
plt.ylabel(‘price (Rs)’)
plt.title(‘Price variation based on car body type’)


Observation :

You can see that there is a lot of data density around sedan type car and price mostly falls in the budget range of $10K to $15K . Second most used car body type comes out to be a hatchback. Wagon type mostly falls in low-budget range.

What’s Next :

There are more plots which we have not covered yet, like:

  1. Violin plot
  2. Stacked plot
  3. Stem Plot
  4. Line Plot
  5. Box Plot

Which we will cover in part 2 of this series on Data Visualization. Also, we will cover

“Data visualization using Seaborn package in detail “

When to Use What Type Of Data Visualization Plots/Charts?

Leaving you all, with this wonderful pictorial representation of a data visualization graph type, which explains what type of graphs you can choose based on your data analysis requirements:

Summing Up:

It is absolutely recommended to add Data Science understanding for all software engineers who wants to take advantage of the all the amazing opportunity this field of data engineering is poised to offer. With data engineering augmented with AI/ML techniques you can really grow fast and become a instrument of change for your organization or your own startup.

The Future will be all about data analysis, data prediction, product recommendations, and process automation, all these will require a lot of data engineers who can help organizations to make accurate, fast, and intelligent decisions regarding services and product offerings.

Leave a Comment