Advance concept of Pandas

E

arlier in course Python packages for Data Science, I have covered basics of pandas package. Here I will be covering some advanced concepts, that pandas have to offer.

The list of the concepts that I will be covering in this post:

  1. Creating a dataset from lists
  2. Using the plot function of pandas
  3. Pandas with time series
  4. Reindexing the index in pandas
  5. Resampling of data
  6. Chaining and filtering
  7. Grouping of data
  8. Transforming data
    1. Using apply and transform functions
    2. Using filter and map operations

To demonstrate examples we have taken TMDB 5000 movies dataset from Kaggle.

Before we dive into understanding the above mentioned concepts let's see how numpy library can be used with python. Below is an example to showcase how pandas can interoperate with numpy.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.values)
print(type(tmdbDataSet.values)) # numpy.ndarray

We will learn a lot more about how pandas can interoperate with numpy in this post.

Creating a dataset from lists

We can generate a pandas dataset from the lists as well. Let's look at an example of how to create a dataset from the lists.

import pandas as pd

index = ['Name', 'Salary', 'Age']
details = [['Ashok', 'Mike', 'Arun'], [1200, 1400, 2500], [23, 28, 30]]
zippedList = list(zip(index, details))
dictObject = dict(zippedList)
df = pd.DataFrame(dictObject)
print(df)

We can also update the labels of the data columns. To update labels refer to the code shown below:

# Let's rename the column of the dataset which we have generated from the list

df = pd.DataFrame(dictObject)

list_labels = ['Age(In Years)', 'Name', 'Salary (In $)']
df.columns = list_labels
print(df)

We can also change the headers of the column at the time of creating datasets using names argument.

import pandas as pd

tmdbDataSet_sub = pd.read_csv('tmdb_5000_movies.csv', header=0, 
names=['budget', 'genres', 'home_page', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'movie_status', 'tagline', 'title', 'vote_average',
       'vote_count'], comment='#')
print(tmdbDataSet_sub.head())

Note: We have to pass header argument also along with names argument to rename the column labels.

Comment argument will ignore the line in the file having '#' symbol in this case.

After reading the file we can export the file to csv or to excel. Also, we can specify if we need indexes to be exported to file.

tmdbDataSet.to_csv('tmdb_movies.csv', index=False)

tmdbDataSet.to_excel('tmdb_movies.xlsx', index=False)

Using the plot function of pandas

Pandas provide data visualization by interoperating with the matplotlib library. We can plot all the columns of the dataset using the .plot function. Let's look at the example shown below:

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

tmdbDataSet.plot()
plt.show()

Using the above code we can plot all columns to the same graph. But output looks bit clumsy and it's difficult to understand also if columns we are plotting have different measure values. Let's look at an alternative approach to plot columns in the graph:

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

tmdbDataSet.plot(subplots=True)
plt.show()

You would have noticed that this time we are passing subplots argument to plot function. It will plot all the columns in the different graph and therefore making it easy to understand.

We can also plot selected columns from the dataset.

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

columnList = ['vote_count','budget']
tmdbDataSet[columnList].plot(subplots=True)
plt.show()

From the output of the above graph, one thing to observe is how pandas plot function put index value of data frames to the x-axis. Now we can see how we can manually set any value to the x-axis.

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

tmdbDataSet.plot(x='production_companies', y=['budget'])
plt.show()

We can even generate a box plot or scatter graph using pandas plot function. 

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

cols = ['vote_count','budget']

tmdbDataSet[cols].plot(kind='box', subplots=True)

plt.show()

tmdbDataSet.plot(kind='scatter', x='budget', y='vote_count', s=tmdbDataSet.popularity)
plt.show()

Scatter plot accepts argument s for the size of each circle in the plot.

We can also plot a histogram plot using pandas. Histogram plot depicts the PDF (Probability distribution function). It's output shown us the probability of occurrence of the value we are plotting on the x-axis.

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

tmdbDataSet.plot(kind='hist', y='vote_average')
plt.show()

There is an additional parameter called cumulative that we can pass to tell histogram whether to plot PDF or CDF. (Cumulative Distribution Function)

The output of the CDF plot depicts the probability of having a value plotted at the x-axis.

import pandas as pd
import matplotlib.pyplot as plt

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

tmdbDataSet.plot(kind='hist', y='vote_average', normed=True)
plt.show()


# Observations
# The output depicts probability of having different vote_averages

NOTE: We can plot charts in different rows and columns using below code:

figure, axes = plt.subplots(nrows=2, ncols=1)

tmdbDataSet.plot(ax=axes[0], kind='hist', y='vote_average')
tmdbDataSet.plot(ax=axes[1], kind='hist', y='vote_average', normed=True, 
cumulative=True)

plt.show()

# Observations
# Here 1st curve plotted is PDF as we have seen earlier.
# And 2nd second plot plotted is CDF.

Here we have declared axes, and then we can use ax argument to plot various graph as one.

Pandas with time series

If we have a date column in our dataset and we want to create it as an index, pandas provide index_col and parse_dates parameter to do so.

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv', index_col='release_date', parse_dates=True)

print(tmdbDataSet.head())
print(tmdbDataSet.loc['2010-Aug-01'])
print(tmdbDataSet.loc['2010-01-01 21:00:00':'2010-05-11 22:00:00'])

The advantage of parsing date column and making it as an index is we can filter result based on the date. As an example, we are filtering results by 1st August 2010.

Also, we can leverage the loc function to check rows between the date range.

We can also format date column in the list. Pandas provide to_datetime function to do so.

import pandas as pd

print(pd.to_datetime(['2010-01-01 21:00:00','2010-05-11 22:00:00'], format='%Y-%m-%d %H:%M'))

Plotting charts with time series column

On calling plot function if we have an index as date column, pandas will automatically make adjustments to the plots being plotted. 

import pandas as pd

tmdbDataSet_date_index = pd.read_csv('tmdb_5000_movies.csv', index_col='release_date', 
                                           parse_dates=True)

tmdbDataSet_date_index.vote_count.plot()
plt.show()

Reindexing the index in pandas

Reindexing of the dataset is required when we are combining multiple datasets. To reindex we can also specify panadas that how we want to fill the index using 'method' argument to reindex function.

# Without using any fill technique
dataSet = dataSet2.reindex(dataSet1.index)

# Using forward fill
dataSet = dataSet2.reindex(dataSet1.index, method="ffill")

# Using backward fill
dataSet = dataSet2.reindex(dataSet1.index, method="bfill")

Indexes are immutable. We cannot update an index of the dataset. If we try to execute below command we will get an error:

tmdbDataSet.index[0] = 24654

# Output
# TypeError: Index does not support mutable operations

There is an alternative way, we can update an index of a whole of the dataset:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

tmdbDataSet.index = [x * 2 for x in range(0, 4803)] # Using comprehension list

The above code will execute successfully as the tmdbDataSet dataset has 4803 rows and we are updating an index of all the rows.

We can also give a name to index using the command shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

print(tmdbDataSet.index.name) # None

tmdbDataSet.index.name = 'movie_index'
print(tmdbDataSet.index.name) # movie_index

Hierarchical Indexes

We can even declare Hierarchical Indexes on the dataset. Hierarchical Indexes means more than one index on the given dataset.

To set multiple indexes on the dataset we need to use a set_index function. We can sort the indexes created using a sort_index function as shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

# Adding index on release_date and homepage column
tmdbDataSet_multi_index = tmdbDataSet.set_index(['release_date', 'status'])

# Sorting the index columns
tmdbDataSet_multi_index = tmdbDataSet_multi_index.sort_index()

The benefit of adding multiple indexes is that we can slice the data easily. Let's see an example shown below:

# To get list of all movies released on 1916-09-04
print(tmdbDataSet_multi_index.loc[('1916-09-04','Released')])

# To get list of all movies released on 1916-09-04 and 2010-03-03
print(tmdbDataSet_multi_index.loc[(['1916-09-04', '2010-03-03'],'Released'), :])

# To get list of all rumored movies
print(tmdbDataSet_multi_index.loc[(slice(None), 'Rumored'), :])

NOTE: While creating Hierarchical Indexes or multiple indexes we need to note the order of indexing. In example shown above if we need list of rumored movies which is in second index, we still need to pass the first index as shown above. To do so we need to use slice(None).

We can also swap the indexes in our dataset. And it is also possible to reset the indexing of the dataSet to the original.

# swapping index of multilevel index
tmdbDataSet_multi_index_swap = tmdbDataSet_multi_index.swaplevel(0,1)
print(tmdbDataSet_multi_index_swap.head())

# Resetting Index
tmdbDataSet_original = tmdbDataSet_multi_index_swap.reset_index()
print(tmdbDataSet_original.head())

If our dataset has multiple indexes we can't pivot the dataset. But we can pivot after stacking or unstacking the dataset. We will learn more about it in future posts.

Resampling of data

We can also resample our dataset in pandas. Resampling means, using statical methods to calculate results. There are two types of sampling:

Downsampling: Reduce rows in our dataset based on condition. For e.g: Consider you have a data set which records humidity every day. You can downsample dataset, which records humidity every month. Here, you are reducing the number of rows in the dataset.

import pandas as pd

tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv', 
                                    parse_dates=True, index_col='release_date')

print(tmdbDataSet_date_index.resample('A').sum()) # Down sampling data yearly

print(tmdbDataSet_date_index.resample('A').sum().count()) 
# Down sampling data yearly with chaining

Upsampling: Increase rows in our dataset based on condition. For e.g: Again consider a data set which records humidity every day. You can upsample dataset to every hour. Here, you are increasing the number of rows in the dataset.

NOTE: It is always advisable to use sampling with some statistical method like mean, count, sum etc.

import pandas as pd

tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv', 
                                    parse_dates=True, index_col='release_date')

print(tmdbDataSet_date_index.resample('H').sum()) # up sampling data hourly
print(tmdbDataSet_date_index.resample('4H').sum()) # up sampling data every 4 hours

Chaining and filtering

With pandas, while resampling/filtering data we can also chain methods and get the desired results. Consider the example shown below:

import pandas as pd

tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv', 
                                    parse_dates=True, index_col='release_date')

# Down sampling data yearly with chaining
print(tmdbDataSet_date_index.resample('A').sum().count()) 

Grouping of data

Pandas provide groupby function using which we can also group our dataset. We can do several operations with a groupby function, like electing particular column, grouping by multiple columns, performing multiple aggregation and lot more. Consider the example shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

# Count of dataset on each day
print(tmdbDataSet.groupby('release_date').count())

# Sum of budget made on each day by 5000 movies
print(tmdbDataSet.groupby('release_date')['budget'].sum())

# Grouping by the mutiple columns and selecting certain columns
print(tmdbDataSet.groupby(['release_date', 'runtime'])[['popularity', 'budget']].sum())

# Multiple aggregations
print(tmdbDataSet.groupby(['release_date', 'runtime'])[['popularity', 'budget']]
.agg(['sum', 'count']))

Transforming data

We can even transform our dataSet based on the desired output that we want. We can transform data using divide, multiply, convert a string to upper case etc using pandas. Let's look at the example shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

# Divide vote_average by 2
print(tmdbDataSet.vote_average.floordiv(2))

# Convert column to upper case
print(tmdbDataSet.status.str.upper()) 

# Drop column from the dataset
tmdbDataSet_dropped = tmdbDataSet.drop(['production_countries'], axis='columns')
print(tmdbDataSet_dropped.head())

To perform an advanced level transformation to the datasets we can use apply and transform functions. Below is the example describing transform   and apply function:

Using transform Function

We can transform our dataset using transform function. Let's look at an example to understand transform function in the better way:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

grouped_dataset = tmdbDataSet.groupby(['original_language'])

def find_en_lang(series):
    if series.values[0] == 'en':
        return 'english'
    else:
        return 'non-english'

tmdbDataSet.loc[:,'original_language_en'] = grouped_dataset.original_language.transform(
    find_en_lang)
display(tmdbDataSet.head())

In the example shown above, we are manipulating original_language column and we are creating a new column in a dataset with values 'english' or 'non-english'.

Using apply Function

Very similar to transform function, we can apply any function which will lead to the transformation of data. Let's look at an example to understand apply function in the better way:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

## Converting runtime to hours
tmdbDataSet.loc[:,'runtime_hours'] = tmdbDataSet.apply(lambda x:x['runtime']/60, axis=1)
display(tmdbDataSet.head())

In the example shown above, we are converting runtime of the movie which is in minutes to hours. Also, we are using lambda function to do so.

Difference between transform and apply function

The key difference between transform and apply function is:

  1. Apply function passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function.
  2. The custom function passed to apply can return a scalar, or a Series or DataFrame. The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

Using Filter function

We can also filter don over results and can perform the transformation on the filtered results. Let us look at an example shown below and understand it:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

grouped_dataset = tmdbDataSet.groupby(['original_language'])

# Movies whose budget is greater than 25cr
display(grouped_dataset.filter(lambda x: x['budget'].sum() > 250000000))

In the example shown below, we are filtering out the results based on language. We only want results whose sum of the budget for a specific original_language is more than 25cr.

Using map function

Till now we have seen groupby operations being performed on the columns of the dataset. What if we want to perform groupby based on our very own condition? We can use map function in such a scenario. Consider the example shown below:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

language = (tmdbDataSet['original_language'] == 'en').map({True:'English', 
                                                           False:'Non-English'})
display(tmdbDataSet.groupby(language)['budget', 'original_language'].mean())

In an example shown above, we are performing groupby based on over own condition. We are grouping over dataset to 'English' or 'non-English' groups.

Aggregating dataset

Aggregating based on multiple measures

We can also aggregate columns value in the dataset. Pandas provide agg function using which we can perform aggregation functions. Consider the example as shown below:

import pandas as pd

tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv', 
                                    parse_dates=True, index_col='release_date')

# Popularity of movie depends on vote_average, revenue - budget, popularity
tmdbDataSet_date_index['profit'] = tmdbDataSet_date_index['revenue'
                                                         ] - tmdbDataSet_date_index['budget']
tmdbDataSet_date_index_grouped = tmdbDataSet_date_index.groupby('title')
tmdbDataSet_date_index_grouped_sub = tmdbDataSet_date_index_grouped[['vote_average',
                                                                     'profit', 'popularity']]

# Max and min value in columns
agg_results = tmdbDataSet_date_index_grouped_sub.agg(['max', 'min'])
print(agg_results)

# Observation
# The result depicts max and min value of each movie title
# Also you might have noticed min and max values are same.
# This is because each movie title has only one row in dataset
# Can you claim which movie is most popular and profitable?

Aggregating based on dictionary object

Using agg function, we can also aggregate our results based on the dictionary object. It means we can have different aggregation measure for different columns in the dataset. Let's see an example:

import pandas as pd

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

# Extract year of release_date and set release_date column as index
tmdbDataSet_agg = tmdbDataSet.copy()
tmdbDataSet_agg['year'] = pd.to_datetime(tmdbDataSet_agg['release_date']).dt.year
tmdbDataSet_agg['year']= tmdbDataSet_agg['year'].fillna(0.0).astype(int)
tmdbDataSet_agg.set_index('release_date', inplace=True)
tmdbDataSet_agg['profit'] = tmdbDataSet_agg['revenue'
                                                         ] - tmdbDataSet_agg['budget']

def countMovies(series):
    return series.count()
aggObject = {'profit':'sum', 'title': countMovies }
tmdbDataSet_grouped = tmdbDataSet_agg.groupby(['year']).agg(aggObject)

# Observations
# Here we are trying to calculate no of movies released in a year
# Year which has the most profit
# Do you know in which year most of movies are released?

Let's plot the chart and see the trend in movie budget and no of release over a year.

# Cleaning up the data
tmdbDataSet_grouped = tmdbDataSet_grouped.drop([0], axis='rows')

# Adding a new column as same as index column to plot scatter chart
tmdbDataSet_grouped['y'] = list(tmdbDataSet_grouped.index)

# Plotting scatter chart
tmdbDataSet_grouped.plot(kind='scatter', y='y', x='profit', s=tmdbDataSet_grouped.title)


#Observations
# Do you see any trend in profit and release count over an year?
# Do share your observations in comment section

In the above post, we have extensively explored pandas python package with the help of an example dataset from Kaggle. I would advise you to explore more about the dataset that we have used and share your opinions in the comment section.

Also for your practice, I am sharing my Jupiter notebook link where you can refer and follow along with the post.

Happy Learning!

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Tavish Aggarwal is a front-end Developer working in a Hyderabad. He is very passionate about technology and loves to work in a team.