Pandas: Merging datasets

 

T

he industry is dealing with chunks of data gathered from various sources. To perform analysis of data there is a huge possibility to gather data from various sources and convert into one dataset.

 

To do this Pandas makes our task very easy. Pandas provide various functions using which we can merge datasets seamlessly.

Here, In this post Merging datasets, we will learn how to merge datasets based on various conditions. To merge datasets we will be using an example from Kaggle.

We will be doing analysis on restaurants data based on rating and will find out which restaurant is best. There are few additional datasets which we have used to explain concepts. Please find them on GIT.

I am also providing Python Notebook to follow along with the post and practise of your own.

Merging multiple datasets

There are multiple ways available to merge datasets. Below are ways which pandas provide to merge datasets:

  1. Append
  2. Concat
  3. Join
  4. Merge

We should be very cautious while choosing the way that we are going to use to merge multiple datasets into a single dataset.

Let us start with exploring each of the methods and see the advantages. After exploring each method individually we will focus on difference and will understand when to use which method.

Append

Pandas provide append function which can be used to merge multiple datasets. The example demonstrated below shows the syntax to use the append method:

import pandas as pd

englishScoreData = pd.read_csv('merge_concat_join/english_marks.csv', 
                               index_col='Student')
geographyScoreData = pd.read_csv('merge_concat_join/geography_marks.csv', 
                                 index_col='Student')
mathsScoreData = pd.read_csv('merge_concat_join/maths_marks.csv', 
                             index_col='Student')

display(englishScoreData.append(geographyScoreData).append(mathsScoreData))

To ignore the index:

display(englishScoreData.append(geographyScoreData, ignore_index=True).
        append(mathsScoreData, ignore_index=True))

As simple as it looks. We can merge two datasets easily with using append function. If we need to merge more than two datasets we use a chain of append functions.

Concat

Pandas provide concat function as well to merge multiple datasets. There are ways/types using which we can concat our datasets as listed below:

  1. Using basic concat function
  2. Column-wise concatenation
  3. Concatenating to get the multilevel row index
  4. Concatenating to get multilevel column index

Using basic concat function

Before talking about concat function, let's look at an example demonstrated below:

import pandas as pd

englishScoreData = pd.read_csv('merge_concat_join/english_marks.csv', 
                               index_col='Student')
geographyScoreData = pd.read_csv('merge_concat_join/geography_marks.csv', 
                                 index_col='Student')
mathsScoreData = pd.read_csv('merge_concat_join/maths_marks.csv', 
                             index_col='Student')

## Row wise concatination
display(pd.concat([englishScoreData, geographyScoreData], axis= 0).reset_index())

The concat function accepts an array of datasets that we need to merge into single datasets. Unlike the append method, we can pass as much as datasets that we need to concat to single concat function.

Using concat function, we can concat our datasets row wise or column wise using 'axis' parameter. In the example shown above, we are concatenating data row-wise as we are passing 0 to 'axis' parameter.

Column-wise concatenation

To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 or axis='columns'.

import pandas as pd

englishScoreData = pd.read_csv('merge_concat_join/english_marks.csv', 
                               index_col='Student')
geographyScoreData = pd.read_csv('merge_concat_join/geography_marks.csv', 
                                 index_col='Student')
mathsScoreData = pd.read_csv('merge_concat_join/maths_marks.csv', 
                             index_col='Student')

## Row wise concatination
display(pd.concat([englishScoreData, geographyScoreData], axis= 1))

Earlier we learned how to concat data row-wise. Here in this example, we are concatenating datasets column-wise by providing axis value as 1.

Concatenating to get the multilevel row index

We may have a requirement where we need to have two or more row-level index while concatenating datasets. The concat function accepts 'keys' as a parameter which is used to create multiple row level indexes. Consider an example as shown below:

import pandas as pd

englishScoreData = pd.read_csv('merge_concat_join/english_marks.csv', 
                               index_col='Student')
geographyScoreData = pd.read_csv('merge_concat_join/geography_marks.csv', 
                                 index_col='Student')

## Multi-level row index
display(pd.concat([englishScoreData, geographyScoreData],keys=[
    'English', 'Geography']))

NOTE: Make sure keys length is same as the number of datasets we are concatenating.

In an example shown above, we need to concatEnglishh and Geography marks dataset and we also need to have both of the indexes which will help to slice the dataset.

I have already explained Indexing in Pandas. Check it out if you have missed the post.

If you are following along you know to slice the dataset based on the inner index or outer index, we can do it as shown in below example:

# Get list of selected subjects
display(studentResult.loc[['English', 'Geography']])

## Get all marks of particular student
idx = pd.IndexSlice
display(studentResult.loc[idx[:,'Joy'], :])

NOTE: Create an alias for pd.IndexSlice called idx. A slicer pd.IndexSlice is required when slicing on the inner level of a MultiIndex.

Concatenating to get multilevel column index

We have seen how to concat datasets and have multiple row level indexes. Now let's see how to have multiple column level indexes:

display(pd.concat([englishScoreData, geographyScoreData, mathsScoreData], keys=[
    'English', 'Geography', 'Maths'], axis= 1, join='inner'))

NOTE: An inner join selects the rows which have common indexes.

To get multilevel column indexes in addition to 'keys' parameter we need to pass the 'axis' parameter as well to concat function.

The default value of 'axis' parameter is 0.

That's all I have for usage of append and concat function. It's time for you to practice now and master merging datasets using append and concat function.

It's time to see when to use append and concat functions?

Difference between append and concat

  1. We can merge data vertically using the append method but to merge data horizontally we need to use the concat method.
  2. In order to merge more than two datasets, we need to use a chain of append method. But using a concat method we can pass an array of datasets.

Join

We can also use pandas join method to merge datasets. Please find an example shown below explaining the use of the join keyword:

import pandas as pd

englishScoreData = pd.read_csv('merge_concat_join/english_marks.csv', 
                               index_col='Student')
geographyScoreData = pd.read_csv('merge_concat_join/geography_marks.csv', 
                                 index_col='Student')

display(geographyScoreData.join(englishScoreData, lsuffix='_geography', rsuffix='_english'))

You might have noticed that we are using 'lsuffix' and 'rsuffix' as a parameter in join function. The usage of this is to add a suffix to column header if there are two columns with the same name in both of the datasets.

Sometimes we need to join our datasets on the specific column. In that case, we can pass additional parameter 'on' to join function. Consider the example shown below explaining the joining of datasets on the specific column:

display(userPaymentData.join(userRatingData.set_index('userID'), lsuffix='_userPayment', 
                             rsuffix='_userRating', on='userID', how='inner').dropna())

NOTE: The default value of how is left while merging dataset using join function.

Merge

Merge extends concat and gives us more functionality which merging datasets. The default strategy for pd.merge() is an inner join. The example shown below merge two datasets based on the specified column name:

userBy_payment_rating_merged = pd.merge(userPaymentData, 
                                                userRatingData, on='userID')
print(userBy_payment_rating_merged.head())

Merging on non-matching column label

Consider the scenario where we have two datasets with different column name but has the same data. And we need to merge data on the column.

In order to tackle the above problem, merge function accepts 'left_on' and 'right_on' as a parameter. And we can specify a different column name from each dataset on which we need to merge the datasets.

userBy_cuisine_payment_rating_merged = pd.merge(userCuisineData, userBy_payment_rating_merged, 
                                                left_on='user_ID', right_on='userID')
userBy_cuisine_payment_rating_merged.sort_values('userID', inplace=True)
userBy_cuisine_payment_rating_merged.drop('userID', axis=1, inplace=True)

top_rated_places_users = userBy_cuisine_payment_rating_merged[
    (userBy_cuisine_payment_rating_merged.rating == 2) &
    (userBy_cuisine_payment_rating_merged.food_rating == 2) & 
    (userBy_cuisine_payment_rating_merged.service_rating == 2)]

print(top_rated_places_users.head())

The example demonstrated above merge's the dataset on the user_id column from the userCuisineData dataset and userId column from the userBy_payment_rating _merged dataset.

Merging on multiple columns

We can merge dataset on multiple column names. In order to do so, we can pass an array as a value to right_on and left_on or on the parameter. Let's look at an example as shown below:

# Default merge is inner merge
combined_user_place_data = pd.merge(top_rated_places_users, cuisineData, 
         left_on=['placeID', 'Rcuisine'], right_on=['placeID', 'Rcuisine'])
print(combined_user_place_data.head())
print(combined_user_place_data.info())

Right join

The merge function also accepts 'how' parameter using which we can specify the type of join that we need to perform while merging the datasets. Let's look at an example shown below:

# We can even perform outer, left or right merge as well
# Let's perform right merge now
combined_user_place_left_data = pd.merge(top_rated_places_users, cuisineData, 
         left_on=['placeID', 'Rcuisine'], right_on=['placeID', 'Rcuisine'], how='right')
print(combined_user_place_left_data.head())
print(combined_user_place_left_data.info())

NOTE: The default join for merge function is inner join.

Using on instead of left_on and right_on

If we know the columns names on which we are joining are same from both of the datasets, then we can pass 'on' parameter instead of passing 'left_on' and 'right_on' parameter. Let's look at an example shown below:

# If column names are duplicate we can avoid right_on and left_on
# Instead we can directly use on attribute
combined_user_place_on_data = pd.merge(top_rated_places_users, cuisineData, 
                                       on=['placeID', 'Rcuisine'], how='right')
print(combined_user_place_on_data.head())
print(combined_user_place_on_data.info())

Merge ordered

Before using the merge_ordered function we have to make sure that over two datasets which we want to merge are ordered dataset. The merge_ordered function also behaves same like merge function, but the difference is with using the merge_ordered function we get ordered results.

combined_user_place_prefix_data = pd.merge_ordered(top_rated_places_users, cuisineData, 
         on=['placeID'],suffixes=['_places','_cuisine'], fill_method='ffill')
print(combined_user_place_prefix_data.tail())
print(combined_user_place_prefix_data.info())

NOTE: The default join for the merge_ordered function is outer join contrasting default join of merge function which is inner join.

merge_asof

Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept. Consider an example shown below:

top_rated_places_users.is_copy = False
top_rated_places_users.sort_values('placeID', inplace=True)

combined_user_place_prefix_data = pd.merge_asof(top_rated_places_users, cuisineData, 
         on=['placeID'],suffixes=['_places','_cuisine'])
print(combined_user_place_prefix_data.head())
print(combined_user_place_prefix_data.info())

Let's conclude the post with the difference between append, concat, join and merge function provided by pandas.

  1. append: We use append function when we have to merge datasets only vertically
  2. concat: We use concat function when we have to merge datasets vertically and horizontally. We can also perform inner and outer join on the indexes.
  3. join: We use join function when we have to perform inner, outer, left and right join on indexes.
  4. merge: It is the most advance versioning of joining datasets. Support multiple join on multiple columns.

Here in this post we have mastered techniques used to merge out data into single dataset. Please share your reviews on post in comment section below.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Tavish Aggarwal is a front-end Developer working in a Hyderabad. He is very passionate about technology and loves to work in a team.