Importing data using Python

Importing data using Python

In today's world, there is a lot of data being generated from various devices. The format of data varies from flat files to tabular structure. In this post, we will be looking into python packages, for importing data using python. We will be looking at techniques to import following file types using python packages:-

  1. Flat files - .txt, .csv files
  2. Pickled file
  3. Excel files
  4. SAS files
  5. STATA file
  6. HDF5 files
  7. mat file
  8. Relational database
  9. Reading data from web

If you are not comfortable with using python packages like pandas, numpy etc please go through post Python packages for Data Science.

And if you want to know/refresh concepts of Python before diving into importing the file using Python, I would recommend going through the below posts:

  1. Python for Data Science - part 1

  2. Python for Data Science - part 2

  3. Python for Data Science - part 3

Let's get started with flat files.

Flat Files

A flat file is a file where data is stored in plain text format. Some of the formats of flat files are .txt (text file), .csv (comma separated file) etc. Let see how to read data from a text file.

To read a data from a text file there are basically three steps:

  1. Open the file
  2. Read data from the file
  3. Close the file

Code demonstrating reading data from the text file:

file = open('sample_file.txt', mode='r') # r syands for read mode

print(file.readline()) # Reading 1st line of file

file.close()

There is a better approach to read a file using the context manager, using this approach we don't have to worry about closing the file after reading data from it.

with open('sample_file.txt') as file:
    print(file.readline()) # Reading 1st line of file

You can try with file sample_file.txt.

We can read a .csv file using numpy or pandas. Here I will be demonstrating both the ways to read .csv file.

Using numpy array:

Numpy array provides two different functions using which we can read .csv files. These are:

  1. loadtxt()
  2. recfromcsv()

Let's see how to read data from files using two functions. Please refer code shown below:

import numpy as np

file = 'population.csv'

population = np.recfromcsv(file, delimiter=',')

population_data = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

Here we have used skiprows option as 1st row of data set is headers.

You can try with the population.csv file

Using pandas:

If you have been following around we have already imported the .csv file using pandas. If not don't worry I will demonstrate it one more time. But here we will go one step forward and convert pandas dataset generated to numpy array. Let's see this in action.

data = pd.read_csv(file, nrows =5) # Reading 5 rows

data_array = data.values

print(type(data_array))

#output

Pickled file 

The pickle file extension is associated with the Python Pickle module used to implement the powerful algorithms for serializing and de-serializing a Python object structure.

The pickle file stores byte stream representing objects. Let's first understand how to create a pickled file and then we will see how to import pickled file.

import pickle
# Writting to pickled file
a = ['Python','Pickled','file']

file_Name = "pickledDemo"
fileObject = open(file_Name,'wb')
pickle.dump(a,fileObject)
fileObject.close()

# Reading from pickled file
fileObject = open(file_Name,'rb')

b = pickle.load(fileObject)
print(b)

Now let's see how to import pickle file using a context manager approach:

import pickle

file_Name = "pickledDemo"

with open(file_Name, 'rb') as file:
    text = pickle.load(file)

print(text)

You can try with the pickledDemo file.

Excel files

If you have used Microsoft office you must be familiar with the excel files. Excel file looks very similar to tables. Refer the example shown below to import excel file using pandas package:

import pandas as pd

file = 'sampleFile.xlsx'

xl = pd.ExcelFile(file)

#sheet names in  excel file
print(xl.sheet_names)

# Load a sheet into a DataFrame by name
df = xl.parse('Sheet1')


print(df.head())

# Load a sheet into a DataFrame by index
df_index = xl.parse(0)

# Print the head of the DataFrame df2
print(df_index.head())

You can test with sampleFile.xlsx

SAS files

SAS files are generally generated by the software named as Statistical Analytics Software.

This software is generally used for data modeling and analytics, therefore making it popular among data scientist. We can import SAS file using package SAS7BDAT in python. Let's look into the code to import SAS file:

from sas7bdat import SAS7BDAT

with SAS7BDAT('test.sas7bdat') as file:
    df_sas = file.to_data_frame()

print(df_sas.head())

You can test with test.sas7bdat

STATA files

There could be multiple resources to generate .dta file, but the most common way to generate .dta file is using Stata data analysis software. Using this software we can generate .dta file where file store datasets.

The way to import .dta file using python is as shown below:

import pandas as pd

df = pd.read_stata('profits.dta')

print(df.head())

You can test with profits.dta file.

HDF5 files

HDF5 is technology to manage extremely large and complex files. It contains multidimensional arrays of scientific data. H5 files are commonly used in aerospace, physics, engineering, finance etc. Let's see how we can import the HDF5 file using h5py package:

import numpy as np
import h5py

file = 'test_data.hdf5'

data = h5py.File(file, 'r')

for key in data.keys():
    print(key)

group = data['testing']

for key in group.keys():
    print(key)

You can try with test_data.hd5.

mat file

This is the type of file which is generated by the MATLAB program.

To import the .mat file using Python package scipy.io refer the code shown below:

import scipy.io

mat = scipy.io.loadmat('test_file.mat')

# Print the keys of the MATLAB file
print(mat.keys())

You can try with test_file.mat.

Relational database

The relational database consists of the set of tables which have data in form of column and rows. And relation can be established between columns of the different table.

Popular relational database:

  1. Postgres
  2. Mysql
  3. SqlLite

I won't be covering basics about how relational database works. An assumption is that you have some knowledge about any of the relational database.

Let's discuss the SqlLite database and import a table into the dataset.

from sqlalchemy import create_engine

engine = create_engine('sqlite:///localhost.database') #connection string to database

table_names = engine.table_names()

print(table_names)

Here we are using sqlalchemy package to connect to the SQLite database. The above code will fetch out all the tables from our database.

Let's see how we can import data using SQL queries:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    dfAll = pd.DataFrame(rs.fetchall()) # To fetch all the results
    df = pd.DataFrame(rs.fetchmany(size=3)) # To fetch top 3 results
    df.columns = rs.keys()

print(df.head())

It is recommended to use a context manager to connect to the database. Doing so will reduce the efforts of closing the database connections.

Using the way shown above we are querying the database by creating an engine, connect to the engine, query and store result in pandas. There is a more sorted way that pandas provide to query the database:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

df = pd.read_sql_query("select * from Album", engine)

print(df.head())

Pandas provide read_sql_query function using which we can directly query and store results to data sets.

Reading data from web

Python packages make it possible to read a file from the web.

There can be two ways to create our datasets using data source as files from the web. These are:

  1. We can import file locally from the web and create a dataset
  2. We can directly create a dataset without storing a file on the local system

Firstly, let's see how we can create a dataset by downloading a file to our local system:

from urllib.request import urlretrieve
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'test.csv')

df = pd.read_csv('test.csv', sep=';')
print(df.head())

We can avoid saving a file to the local system and read directly from URL. Let's see how we can directly create a dataset without downloading the file to our local system:

import pandas as pd

url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

df = pd.read_csv(url, sep=';')
print(df.head())

NOTE: We can also read excel file from web using pd.read_excel() function. The output of function will be sheet names.

Requests and urlib package:

Using packages urllib and requests package we can read the files directly from the web.

Firstly, let's see the way to import file from the web using urllib package:

from urllib.request import urlopen, Request

url = "https://www.techladder.in"

request = Request(url, headers={'User-Agent' : "Magic Browser"})
response = urlopen(request)
html = response.read()

print(html)

response.close()

We can also perform same operations that we have performed with the urllib package using requests package. Among all the package requests, this is the most downloaded package. Let's see requests package in action:

import requests

url = "https://www.techladder.in"

r = requests.get(url)
text = r.text

print(text)

BeautifulSoup

The data that we got above is not useful. To make most out of it we need to get the structured data from the HTML that we got above. To do so we use beautifulSoup package:

import requests
from bs4 import BeautifulSoup

url = 'https://www.techladder.in'

r = requests.get(url)

html_doc = r.text

soup = BeautifulSoup(html_doc, "lxml")

# Pretify HTML DOM
pretty_soup = soup.prettify()

print(pretty_soup)

# Extract page title
page_title = soup.title

print(page_title)

#Extract all the text from the page
page_text = soup.get_text()

print(page_text.replace('\n\n','\n'))

# Find all a tags on page
a_tags = soup.find_all('a')

for link in a_tags:
    print(link.get('href'))

Read API response as JSON

JSON is used to communicate between server and browser. JSON is a collection of key-value pair. Let's see a sample API which returns the response as JSON.

import requests

url = 'https://reqres.in/api/users?page=1'

r = requests.get(url)

json_data = r.json()
print(json_data)

# Iterate over the keys and value of JSON
for k in json_data.keys():
    print(k + ': ', json_data[k])

 NOTE: We use json.load() function to read JSOn file stored in local system.

Now, we are familiar with the technique of loading any type of file into a dataset. In the next post, I will be explaining techniques that data scientist used to clean the data.