Data Visualization using Python

Tavish Aggarwal
I
t is essential to visualize our data first, before going one step further to predict data. Data Visualization is the process where we extract information about data using various visualization techniques.

 

Here in this post, I will be explaining about various curves that would be an essential element in your toolbox.

In this post, I will be using example dataset from the Kaggle competition. I am also sharing my Jupyter notebook in case you get blocked anywhere while going through the post.

Let's get started.

Importing Packages

The first step before doing anything is to import packages that we need to create a visualization.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 

Reading Dataset 

Once the data is imported, next we need to load CSV file to pandas dataset.

In [2]:
housePropertyDataset = pd.read_csv('house_property_sales.csv')

Visualization

Once we have data imported in the panda's data frame, we are ready to generate visualizations. There is a countless number of visualizations that we can generate from the dataset. Here we will be focusing on various techniques to generate visualizations. We will be focusing on:

  1. Seaborn Library
  2. Zooming axes
  3. Generating subplots
  4. Two-dimensional plots
  5. Generating regression plots
  6. Mesh grid
  7. Strip plot
  8. Swarn plot
  9. Violin Plot
  10. Heatmap Plot
  11. Joint Plot

Let's get started.

Seaborn Library

There is a package seaborn library available using which we can generate detailed chart representation.

Here we have also used the legend to display legend and positioned it top right.

In [3]:
sns.set()

plt.plot(housePropertyDataset['SalePrice'])

plt.legend(loc='upper right')
plt.show()
 

 

Zoom Axes

You will now create the same figure as in the previous exercise using plt.plot(), this time setting the axis extends using plt.xlim() and plt.ylim(). These commands allow you to either zoom or expand the plot or to set the axis ranges to include important values (such as the origin).
 

NOTE: After creating the plot, we can use plt.savefig() to export the image produced to a file.

In [4]:
sns.set()

# We can use the styles for the plot
# print(plt.style.available) used to print all available styles
plt.style.use('ggplot')

plt.plot(housePropertyDataset['SalePrice'])

plt.xlabel('Property ID')
plt.ylabel('Sales Price')

plt.xlim(1100,1200)
plt.ylim(100000,800000)

plt.title('Price of various property located')

plt.show()
 

 

Using subplots() function

Using subplots is a better alternative to using plt.axes(). We can manually add more than one plot in the same figure.

Subplot accepts three arguments,
subplot(n_rows, n_columns, n_subplot)

Subplot ordering is row wise top left corner.

In [5]:
plt.style.use('ggplot')

plt.subplot(1,2,1)
plt.plot(housePropertyDataset['SalePrice'], color='red')
plt.title('Property Sale Price variation')
plt.xlabel('Property ID')
plt.ylabel('Sales Price')

plt.subplot(1,2,2)
plt.scatter(housePropertyDataset['GarageArea'],housePropertyDataset['SalePrice'], color='green')
plt.title('Top 10 Female Population')
plt.xlabel('Property Sales Price')
plt.ylabel('Garbage Area')


# Improve the spacing between subplots and display them
plt.tight_layout()
plt.show()
 

 

2-Dimension Histogram

We can also generate 2D histogram using a hist2D function. Let's look at an example shown below:

In [6]:
x = housePropertyDataset['GarageArea']
y = housePropertyDataset['SalePrice']
plt.hist2d(x,y, bins=(10,20), range=((0, 1500), (0, 700000)), cmap='viridis')

plt.colorbar()

plt.xlabel('Garbage Area')
plt.ylabel('Property Sales Price')

plt.tight_layout()
plt.show()
 

As shown above, we have created rectangular bins for a 2D array. In a similar way, we can also generate hexagonal bins as well.

In [8]:
x = housePropertyDataset['GarageArea']
y = housePropertyDataset['SalePrice']
plt.hexbin(x,y, gridsize=(15,10),extent=(0, 1500, 0, 700000), cmap='winter')

plt.colorbar()

plt.xlabel('Garbage Area')
plt.ylabel('Sales Price')
plt.show()

# Looking at output of the graph, can we say males are more educated from female?
 

Linear Regression

We can also generate a regression plot using the seaborn library. We will be understanding regression plot in detail later. Seaborn provides lmplot() function to generate regression plots.

In [9]:
# Plot a linear regression between 'GarageArea' and 'SalePrice'
sns.lmplot(x='GarageArea', y='SalePrice', data=housePropertyDataset, 
           col='Street') # We can also use 'hue' parameter instead of col parameter to plt on the same graph

# Display the plot
plt.show()
 

NOTE: There are more options that can be configured. Refer Seaborn lmplot documentation.

Residual Plot

As we can see from the above plot that all points are not passing through the straight line. Therefore we have residuals in the plot. Seaborn provides residplot() function using which we can generate residual plot:

In [10]:
sns.residplot(x='GarageArea', y='SalePrice', data=housePropertyDataset, color='blue')

# Display the plot
plt.show()
 

NOTE: There are more options that can be configured. Refer Seaborn Residplot documentation.

Plotting second order regression plots

It is also possible to generate higher order regression plots using order argument in regplot function provided by seaborn.

In [7]:
plt.scatter(housePropertyDataset['GarageArea'],housePropertyDataset['SalePrice'], label='data', 
            color='red', marker='o')

sns.regplot(x='GarageArea', y='SalePrice', data=housePropertyDataset
            , scatter=None, color='blue', label='order 1', order=1)

sns.regplot(x='GarageArea', y='SalePrice', data=housePropertyDataset
            , scatter=None, color='green', label='order 2', order=2)
plt.legend(loc='lower right')
plt.show()
 

NOTE: There are more options that can be configured. Refer Seaborn Regplot documentation.

Mesh grid

Meshgrid transforms the domain specified by vectors x and y into arrays X and Y, which can be used to evaluate functions of two variables.

Here in the example shown below, we are calculating: 3 * sqrt(x^2 + y^2).

In [11]:
u = list(range(1, 10))
v = list(range(11, 20))
X,Y = np.meshgrid(u,v)
Z  = 3*np.sqrt(X**2 + Y**2)

plt.subplot(2,1,1)
plt.contour(X, Y, Z)

plt.subplot(2,1,2)
plt.contour(X, Y, Z, 20) # 20 contour


plt.show()
 

As we have seen above contour function generate lines, but to get lines filled with colors we can use a contourf() function.

In [12]:
u = list(range(1, 10))
v = list(range(11, 20))
X,Y = np.meshgrid(u,v)
Z  = 3*np.sqrt(X**2 + Y**2)

plt.subplot(2,1,1)
plt.contourf(X, Y, Z)

plt.subplot(2,1,2)
plt.contourf(X, Y, Z, 20, cmap='winter') # 20 contour will be mapped


plt.show()
 

Strip Plot

Seaborn provides stripplot() function which gives us the ability to visualize data categorically. Refer to the example shown below:

In [13]:
plt.subplot(1,2,1)
sns.stripplot(x='Street', y='SalePrice', data=housePropertyDataset, jitter=True, size=3)
plt.xticks(rotation=90)

plt.subplot(1,2,2)
sns.stripplot(x='Neighborhood', y='SalePrice', data=housePropertyDataset, jitter=True, size=3)
plt.xticks(rotation=90)

plt.subplots_adjust(right=3)
plt.show()
 

NOTE: In example shown above we have used jitter flag to avoid overlapping of same data points. There are more options that can be configured. Refer Seaborn Stripplot documentation.

Swarn Plot

Strip plot is a great way to visualize data. But when we have the huge amount of data, the data points tend to overlap each other. An alternative is to use a swarn plot where data points don't overlap each other.

In [14]:
plt.subplot(1,2,1)
sns.swarmplot(x='Street', y='SalePrice', data=housePropertyDataset, hue='SaleCondition')
plt.xticks(rotation=90)

plt.subplot(2,2,2)
sns.stripplot(x='Neighborhood', y='SalePrice', data=housePropertyDataset, hue='SaleCondition')
plt.xticks(rotation=90)

plt.subplots_adjust(right=3)
plt.show()
 

NOTE: There are more options that can be configured. Refer Seaborn Swarmplot documentation.

Violin Plot

Violin Plots are similar to a box plot. As they also show max, min, and median of the dataset. In the Violin plot, the distribution is denser where the plot is thicker.

In [27]:
plt.subplot(2,1,1)
sns.violinplot(x='SaleType', y='SalePrice', data=housePropertyDataset)

plt.subplot(2,1,2)
sns.violinplot(x='SaleType', y='SalePrice', data=housePropertyDataset, color='lightgray', inner=None)

sns.stripplot(x='SaleType', y='SalePrice', data=housePropertyDataset, jitter=True, size=1.5)

plt.show()
 

NOTE: We can also generate a combined plot. In the example shown above, the second plot is the combination of Strip plot and Violin plot. There are more options that can be configured. Refer Seaborn Violinplot documentation.

Joint Plot

As we have seen above, Strip plot, Swarn Plot and Violin plot are used to plot univariate distributions (single variable), Joint plot is used for multivariate distribution.

The joint plot shows how the data varies on the x-axis when there is a change in the y-axis. The Joint plot also computes the Pearson coefficient and the p-value. You will be learning more about Pearson coefficient and p-value in the next post.

In [28]:
fig = plt.figure(figsize = (16,89))
sns.jointplot(x='GarageArea', y='SalePrice', data=housePropertyDataset)
plt.xticks(rotation=90)
plt.show()
 
 

NOTE: We can also pass argument type to the joint plot. To learn more about kind attribute refer Seaborn Joint plot documentation.

Heatmaps

Heatmap plot is useful when we need to find out how the feature is dependent on the other. 

In the example shown below, we have plotted Correlation heatmap of the features of the dataset. We can see that lighter colors depict the positive correlation.

In [29]:
numeric_features = housePropertyDataset.select_dtypes(include=[np.number])
sns.heatmap(numeric_features.corr())
plt.title('Correlation heatmap')
plt.show()
 

NOTE: There are more options that can be configured. Refer Seaborn Heatmap documentation.

Pair Plot

It is the plot using which we can have a quick glimpse of the data. Here we can plot the possible combination of the features(columns) in our dataset. Consider an example shown below:

In [30]:
data = housePropertyDataset[['SaleCondition', 'SalePrice','OverallQual',
             'TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']]
sns.pairplot(data, hue='SaleCondition')
plt.show()
 

NOTE: Using pair plot we can only plot numeric columns of the dataset. There are more options that can be configured. Refer Seaborn Pairplot documentation.

Here in this post, we have seen the advantages of using the seaborn library along with the matplotlib library. Seaborn library has much more advanced visualizations to offer which can help us to understand our dataset.

Hope you have learned something new from this post.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Tavish Aggarwal is a front-end Developer working in a Hyderabad. He is very passionate about technology and loves to work in a team.