Importing data using Python

Tavish Aggarwal

December 13, 2023

I

n today's world, there is a lot of data being generated from various devices. The format of data varies from flat files to tabular structures. In this post, we will be looking into Python packages, for importing data using Python. We will be looking at techniques to import the following file types using Python packages:

  1. Flat files - .txt, .csv files
  2. Pickled file
  3. Excel files
  4. SAS files
  5. STATA file
  6. HDF5 files
  7. mat file
  8. Relational database
  9. Reading data from the web

Let's get started with flat files.

Flat Files

A flat file is a file where data is stored in plain text format. Some of the formats of flat files are .txt (text file), .csv (comma-separated file), etc. Let's see how to read data from a text file.

To read data from a text file there are three steps:

  1. Open the file
  2. Read data from the file
  3. Close the file

Code demonstrating reading data from the text file:

file = open('sample_file.txt', mode='r') # r syands for read mode

print(file.readline()) # Reading 1st line of file

file.close()

There is a better approach to read a file using the context manager, using this approach we don't have to worry about closing the file after reading data from it.

with open('sample_file.txt') as file:
    print(file.readline()) # Reading 1st line of file

You can try with file sample_file.txt.

We can read a .csv file using numpy or Pandas. Here I will be demonstrating both ways to read .csv files.

Using numpy array:

Numpy array provides two different functions using which we can read .csv files. These are:

  1. loadtxt()
  2. recfromcsv()

Let's see how to read data from files using two functions. Please refer code shown below:

import numpy as np

file = 'population.csv'

population = np.recfromcsv(file, delimiter=',')

population_data = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

Here we have used the skiprows option as 1st row of the data set is headers.

You can try with the population.csv file

Using pandas:

If you have been following around we have already imported the .csv file using pandas. If not, don't worry I will demonstrate it one more time. But here we will go one step forward and convert the pandas dataset generated to a numpy array. Let's see this in action.

data = pd.read_csv(file, nrows =5) # Reading 5 rows

data_array = data.values

print(type(data_array))

#output

Pickled file 

The pickle file extension is associated with the Python Pickle module used to implement the powerful algorithms for serializing and de-serializing a Python object structure.

The pickle file stores the byte stream representing objects. Let's first understand how to create a pickled file and then we will see how to import a pickled file.

import pickle
# Writting to pickled file
a = ['Python','Pickled','file']

file_Name = "pickledDemo"
fileObject = open(file_Name,'wb')
pickle.dump(a,fileObject)
fileObject.close()

# Reading from pickled file
fileObject = open(file_Name,'rb')

b = pickle.load(fileObject)
print(b)

Now let's see how to import pickle files using a context manager approach:

import pickle

file_Name = "pickledDemo"

with open(file_Name, 'rb') as file:
    text = pickle.load(file)

print(text)

You can try with the pickledDemo file.

Excel files

If you have used Microsoft Office you must be familiar with the Excel files. Excel file looks very similar to tables. Refer to the example shown below to import an Excel file using the pandas package:

import pandas as pd

file = 'sampleFile.xlsx'

xl = pd.ExcelFile(file)

#sheet names in  excel file
print(xl.sheet_names)

# Load a sheet into a DataFrame by name
df = xl.parse('Sheet1')


print(df.head())

# Load a sheet into a DataFrame by index
df_index = xl.parse(0)

# Print the head of the DataFrame df2
print(df_index.head())

You can test with sampleFile.xlsx

SAS files

SAS files are generally generated by the software named Statistical Analytics Software.

This software is generally used for data modeling and analytics, therefore making it popular among data scientists. We can import SAS files using the package SAS7BDAT in Python. Let's look into the code to import the SAS file:

from sas7bdat import SAS7BDAT

with SAS7BDAT('test.sas7bdat') as file:
    df_sas = file.to_data_frame()

print(df_sas.head())

You can test with the test.sas7bdat

STATA files

There could be multiple resources to generate a .dta file, but the most common way to generate a .dta file is using Stata data analysis software. Using this software we can generate a .dta file where the file stores datasets.

The way to import a .dta file using Python is as shown below:

import pandas as pd

df = pd.read_stata('profits.dta')

print(df.head())

You can test with profits.dta file.

HDF5 files

HDF5 is a technology to manage extremely large and complex files. It contains multidimensional arrays of scientific data. H5 files are commonly used in aerospace, physics, engineering, finance, etc. Let's see how we can import the HDF5 file using the h5py package:

import numpy as np
import h5py

file = 'test_data.hdf5'

data = h5py.File(file, 'r')

for key in data.keys():
    print(key)

group = data['testing']

for key in group.keys():
    print(key)

You can try with test_data.hd5.

mat file

This is the type of file that is generated by the MATLAB program.

To import the .mat file using the Python package scipy.io refer to the code shown below:

import scipy.io

mat = scipy.io.loadmat('test_file.mat')

# Print the keys of the MATLAB file
print(mat.keys())

You can try with test_file.mat.

Relational database

The relational database consists of a set of tables that have data in the form of columns and rows. And relation can be established between the columns of the different tables.

Popular relational database:

  1. Postgres
  2. Mysql
  3. SqlLite

I won't be covering the basics of how relational database works. An assumption is that you have some knowledge about any of the relational databases.

Let's discuss the SqlLite database and import a table into the dataset.

from sqlalchemy import create_engine

engine = create_engine('sqlite:///localhost.database') #connection string to database

table_names = engine.table_names()

print(table_names)

Here we are using the sqlalchemy package to connect to the SQLite database. The above code will fetch out all the tables from our database.

Let's see how we can import data using SQL queries:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    dfAll = pd.DataFrame(rs.fetchall()) # To fetch all the results
    df = pd.DataFrame(rs.fetchmany(size=3)) # To fetch top 3 results
    df.columns = rs.keys()

print(df.head())

It is recommended to use a context manager to connect to the database. Doing so will reduce the efforts of closing the database connections.

Using the way shown above we are querying the database by creating an engine, connecting to the engine, querying, and storing results in pandas. There is a more sorted way that pandas provide to query the database:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

df = pd.read_sql_query("select * from Album", engine)

print(df.head())

Pandas provide read_sql_query function using which we can directly query and store results to data sets.

Reading data from the web

Python packages make it possible to read a file from the web.

There can be two ways to create our datasets using data sources such as files from the web. These are:

  1. We can import files locally from the web and create a dataset
  2. We can directly create a dataset without storing a file on the local system

Firstly, let's see how we can create a dataset by downloading a file to our local system:

from urllib.request import urlretrieve
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'test.csv')

df = pd.read_csv('test.csv', sep=';')
print(df.head())

We can avoid saving a file to the local system and reading directly from the URL. Let's see how we can directly create a dataset without downloading the file to our local system:

import pandas as pd

url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

df = pd.read_csv(url, sep=';')
print(df.head())

NOTE: We can also read excel file from web using pd.read_excel() function. The output of function will be sheet names.

Requests and urlib package:

Using packages urllib and requests package we can read the files directly from the web.

Firstly, let's see the way to import files from the web using the urllib package:

from urllib.request import urlopen, Request

url = "https://www.techladder.in"

request = Request(url, headers={'User-Agent' : "Magic Browser"})
response = urlopen(request)
html = response.read()

print(html)

response.close()

We can also perform the same operations that we have performed with the urllib package using the requests package. Among all the package requests, this is the most downloaded package. Let's see the requests package in action:

import requests

url = "https://www.techladder.in"

r = requests.get(url)
text = r.text

print(text)

BeautifulSoup

The data that we got above is not useful. To make the most out of it we need to get the structured data from the HTML that we got above. To do so we use the beautifulSoup package:

import requests
from bs4 import BeautifulSoup

url = 'https://www.techladder.in'

r = requests.get(url)

html_doc = r.text

soup = BeautifulSoup(html_doc, "lxml")

# Pretify HTML DOM
pretty_soup = soup.prettify()

print(pretty_soup)

# Extract page title
page_title = soup.title

print(page_title)

#Extract all the text from the page
page_text = soup.get_text()

print(page_text.replace('\n\n','\n'))

# Find all a tags on page
a_tags = soup.find_all('a')

for link in a_tags:
    print(link.get('href'))

Read API response as JSON

JSON is used to communicate between the server and the browser. JSON is a collection of key-value pairs. Let's see a sample API that returns the response as JSON.

import requests

url = 'https://reqres.in/api/users?page=1'

r = requests.get(url)

json_data = r.json()
print(json_data)

# Iterate over the keys and value of JSON
for k in json_data.keys():
    print(k + ': ', json_data[k])

 NOTE: We use json.load() function to read JSON file stored in local system.

Now, we are familiar with the technique of loading any type of file into a dataset. In the next post, I will explain the techniques that data scientists use to clean the data.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.