Python packages for Data Science

I
t's really important to know about packages that are useful when we are solving data science problem. I will be talking about packages that are available in Python, and are used to solve data related problems.

What are the packages and why is it needed?

Adding everything in python core will make up a mess. So packages are introduced.

A Package is nothing but a bunch of script files which tries to solve a specific problem. Collection of such scripts are called as packages.

There are thousands of packages developed by the developers like us. Here we will be only talking about the packages that are used for data science.

If you have been following around, I have already introduced some of the packages that we have for data science: link.

For quick recap some of the packages that we will be talking about in this posts are:

  1. Numpy
  2. Matplotlib
  3. Pandas
  4. scikit-learn

To use the above packages we need to install them on our system. To install the packages we need to have pip which is package maintenance system for python.

To install pip on your system follow the link.

Once you have pip installed run the below command to install the packages:

pip3 install numpy

Above command will install numpy package on our system.

Now we have the packages installed on our system, its time to use them in our python scripts. To use them in our python scripts firstly we need to import them.

The way we import packages in our script is:

import numpy 

Now to use numpy function we can write the following code:

npmpy.array([1,2])

Another way is:

import numpy as ny

ny.array([1,2])

But if we want to use array() function directly then we can use the following script:

from numpy import array

array([1,2])

Now that we are clear with packages and how to use them in our code, its time to explore what functionality the python packages mentioned above has to provide.

Introduction to Numpy

Numpy package provides an alternative to lists: that is numpy array. Numpy array is more powerful and has lot more features to provide then python list.

Suppose we have two numpy arrays as shown below:

import numpy as ny

a1 = ny.array([1,2,3])
a2 = ny.array([4,5,6])

a1 * a2

#Output:
array([ 4, 10, 18])

But if you try to do it with a normal list:

a1 = [1,2,3]
a2 = [4,5,6]

a1 * a2

Output:
TypeError: can't multiply sequence by non-int of type 'list'

Now you would have realized the power of numpy package.

NOTE: Numpy package assumes that the data type of the array is same.

We can also do subsetting of numpy array as:

import numpy as ny

a1 = ny.array([1,2,3])
print(a1[0])

#Output:
1

We can also create a 2D array with numpy. Syntax to create a 2D numpy array is:

import numpy as ny
a1 = ny.array([[1,2,4], [2,3,5]])

We can also see the number of rows and columns in our array.

a1.shape

Output:
(2, 3) # where 2 represent rows and 3 represent columns

We can perform an operation like addition, multiplication in a similar way as with 1D numpy array.

import numpy as ny

a1 = ny.array([[1,2,4], [2,3,5]])
a2 = a1 + a1

#Output:
array([[ 2,  4,  8],
       [ 4,  6, 10]])

We can also do subsetting of the array in the similar way as we do with lists or numpy 1D array.

import numpy as ny

a1 = ny.array([[1,2,4], [2,3,5]])

print(a1[0][2]) # where 0 represent row and 2 represent column

#Output:
4

Another syntax that we can use is:

print(a1[0, 2])

The above code will also result in the same output.

If we want to check the first row of the 2D array we can do it as:

print(a1[0])

Output:
[1 2 4]

Basic Statistics using numpy

We can easily perform basic statistics operations like mean, median, standard deviation and lot more.

The first step to perform numpy operation is to import numpy package:

import numpy as ny

person = ny.array([20,50,60,40,50]) # weight of person

#To find out mean
person_mean = ny.mean(person)

#To find median
person_median = ny.median(person)

# To find standard deviation
person_sd = ny.std(person)

#To find correlation
person = ny.array([[20,2.6], [50,6], [60, 5.5], [40, 4], [50, 5]]) # Where values depict weight and height respectively

ny.corrcoef(person[:,0],person[:,1])

There are lot more operations like sum() and sort() that are available in python.

import numpy as ny

person = ny.array([20,50,60,40,50]) # weight of person

person_sum = ny.sum(person)

person_sort = ny.sort(person)

There is a lot more that numpy can do. The above demonstration is to get you started to perform numpy operations. To know more about it you can visit a link.

Logical Operator with numpy

Numpy offers logical operators like logical_or and logical_and. Refer the example shown below:

import numpy as np

science_score = np.array([18.0, 20.0, 11.75, 19.50])
math_score = np.array([14.6, 14.0, 18.25, 19.0])

# List of score greater then 13 or less than 10
print(np.logical_or(science_score > 13, science_score < 10))

#ouput:
[ True  True False  True]

# List of math score between 13 and 15
print(np.logical_and(math_score > 13 , math_score < 15))

# output
[ True  True False False]

Looping over numpy array

We can even loop over 1D or 2D numpy array. Looping over a 1D numpy array is as simple as looping over the list. But if we have to loop over numpy 2D array we have to use the nditer function provided by numpy. Consider the example shown below:

# Looping over 1D array
import numpy as np

population = np.array(['500','600','550','700'])

for x in population :
    print(str(x) + ' billion')

# ouput
500 billion
600 billion
550 billion
700 billion


# Looping over 2D array
import numpy as np

population = np.array([['india','500'], ['china','600'], ['autralia','550'], ['france','700']])

for x in np.nditer(population) :
        print(x)

# output
india
500
china
600
autralia
550
france
700

Introduction to Matplotlib

Now we know how we can explore the dataset and get the required information from the dataset using numpy. But do you think is it possible to know the trend or pattern that your data is following without visualization? According to me, the answer is definitely NO. It is really challenging to do that without visualization.

We have matplotlib to help us to visualize our data.

To install matplotlib we need to use pip again:

pip3 install matplotlib

After installing we need to import it before using it:

import matplotlib.pyplot as pyt

population = [123, 243, 456, 690] # in million
year = [1995, 1996, 1997, 1998]

pyt.plot(year, population) # year on x-axis and population on y-axis
pyt.show() # To show the plot

NOTE: plot() function is used to create a line chart.

In a similar way, there are other types of charts as well that we can create. For example, let's try with scatter chart.

pyt.scatter(year, population)

plt.grid(True) # To show grid in graphs

pyt.show()

It will show the scatter chart.

Histogram

What about the histogram? I think that this type of chart is the best way to visualize the data. Let's get started and create a histogram:

import matplotlib.pyplot as pyt

age = [20, 56, 67, 40, 89, 90, 23, 45, 68, 23, 11, 18]

pyt.hist(age, bins=3)
pyt.show()

Histograms works based on the bins. The number of bins we defined the number of bars we get.

As shown above, we can see that three bars are generated. That is how histogram works in python. 

We can pass additional arguments to histograms. To know more try:

help(pyt.hist)

NOTE: If we do not mention bins, python will take 10 by default.

Now we have knowledge about how to create the visualization of the data. Let's see how we can make our charts more expressive.

We can labels and title to our chart before showing them as shown below:

import matplotlib.pyplot as pyt

population = [123, 243, 456, 690] # in million
year = [1995, 1996, 1997, 1998]

pyt.plot(year, population)
pyt.xlabel('year')
pyt.ylabel('population')
pyt.title('Graph showing population over year')

pyt.show()

There are some other customizations as well that we can do to our visualizations. To know more about matplotlib: link.

Introduction to Pandas

We generally have a huge amount of data. And to handle it we have already explored numpy. But is numpy capable of handling the data? The answer is yes but with the limitation. Numpy can handle data if the data is of the same type. In the real world, it is very difficult to have data of the same type. So at that time pandas is what is needed. 

Pandas store data in form of data frames. We can even import CSV file using pandas. Let's explore more the about data frames and see how to store data in the dataFrames.

How to create data frame out of the dictionary?

I have a dictionary as shown in the code below and we have created data frame using pandas out of it.

import pandas as ps

countryData = {
       'country': ['india', 'australia'],
       'population': [500, 600]
}

ps.DataFrame(data=countryData)

# output
     country  population
0      india         500
1  australia         600

To change the index of above data frame:

data = ps.DataFrame(data=countryData)
data.index = ['IN', 'AUS']
print(data)

# output
      country  population
IN       india         500
AUS  australia         600

To print the selected column of the data frame:

data = ps.DataFrame(data=countryData)
data.index = ['IN', 'AUS']

data['country'] # It will output the pandas series
data[['country']] # It will output pandas dataFrame
data[['country', 'population']] # It is used to select more than 1 column

data[0:2] # It will output first 2 observations (rows)

To select the row in DataFrame we use loc or iloc. The difference between two is: If we use iloc then we need to supply an integer index but with loc, we can use the string index only. Look at the example shown below:

data = ps.DataFrame(data=countryData)
data.index = ['IN', 'AUS']

dataset.iloc[1] # will return 2nd row
dataset.iloc['IN'] # error

dataset.loc[1] # error
dataset.loc['IN'] # return 1st row

dataset.loc[['IN','AUS']] # similar to column selection

To import a CSV file as data frame we use read_csv() function provided by pandas:

import pandas as pds

population = pds.read_csv('./population.csv', index_col=0)

NOTE: index_col is set to 0, to tell read_csv function that 1st row in CSV file is header.

Most of the time we will be dealing with large datasets, and it is not a good practice to read all the data in one go. Doing so will have a drastic performance impact. With pandas we can read data in chunks using the chunksize property as demonstrated below:

import pandas as pd

sub_dataset = pd.read_csv('test.csv', chunksize=10)

# on demand data
print(next(sub_dataset))

Logical operations on pandas

Like we can perform a logical operation on numpy array, in the very similar way we can perform logical operations on pandas data frame as well:

import pandas as ps

countryData = { 
              'country': ['India', 'China', 'France'], 
              'population': [560, 880, 990], 
              'averageRate': [15, 37, 18] 
}

dataSet = ps.DataFrame(data=countryData)

Suppose I want to filter out results only which have averageRate above 16.

The first step is to select the column averageRate:

dataSet['averageRate']

Next step is to put the condition:

dataSet['averageRate'] > 16

Next step is to find out data frame for it:

dataSet[dataSet['averageRate'] > 16]

# output:
   averageRate country  population
1           37   China         880
2           18  France         990

We can also use logical and/or condition in pandas as well using numpy. Suppose I want results which have averageRate above 16 and less than 30:

import pandas as ps
import numpy as np

countryData = { 
              'country': ['India', 'China', 'France'], 
              'population': [560, 880, 990], 
              'averageRate': [15, 37, 18] 
}

dataSet = ps.DataFrame(data=countryData)
dataSet[np.logical_and(dataSet['averageRate'] > 16, dataSet['averageRate'] < 30)]

# output
averageRate country  population
2           18  France         990

Looping over pandas dataFrames

Iteration over a Pandas DataFrame is typically done with the iterrows() method. Consider the example shown below:

import pandas as ps

countryData = { 
              'country': ['India', 'China', 'France'], 
              'population': [560, 880, 990], 
              'averageRate': [15, 37, 18] 
}

dataSet = ps.DataFrame(data=countryData)

for index, row in dataSet.iterrows():
    print(index)
    print(row)

If we want to get data of the particular column:

for index, row in dataSet.iterrows() :
    print(str(index) +': '+ str(row['country']))

# ouput:
0: India
1: China
2: France

What if we want to add a new column to the above dataSet that we have created? Yes, we can do that as well in pandas:

for index, row in dataSet.iterrows() :
       dataSet.loc[index, "new_averageRate"] = row["averageRate"] * 1.5

# ouput:
averageRate country  population  new_averageRate
0           15   India         560             22.5
1           37   China         880             55.5
2           18  France         990             27.0

In above example, you can see that we using loc to add a new column to the row. And we are creating new calculated as 1.5 times the average rate.

To know more about pandas visit: link

In this post, we have knowledge of how to use various python packages for data science. I hope it will give you a great kick start to your path of becoming the data scientist. I will be covering about the scikit-learn package in my next tutorial. Stay Tuned and happy learning.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Tavish Aggarwal is a front-end Developer working in a Hyderabad. He is very passionate about technology and loves to work in a team.