Improving the Supervised Learning Model using Python

Tavish Aggarwal

December 13, 2023

Typically in Machine Learning, we generate a Machine Learning model and once the model is generated we use multiple techniques to measure the accuracy of the model. But, the question that might be coming to your mind is: "The accuracy of the model doesn't match the expectations. What can I do to improve the accuracy of the model?"

Valid thought. In this post, we will be looking at a few best practices to generate Models. In other words, we will be looking at techniques to tune the model.

Let's get started. We will be covering the mentioned techniques in detail:

Cleaning the dataset
Categorical Data
Normalizing data

Cleaning the dataset

This is the most crucial step in the entire process of generating Data Science Models. If you don't have a clean dataset like:

The dataset has missing values
The dataset has duplicate records... etc

then, probably your accuracy will probably suffer a lot.

We can use Imputer method provided by sklearn to clean the dataset. Let's look at an example shown below:


from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

In the post, Machine Learning: Cleaning data, I have explained in detail the most common techniques used by Data Scientists to clean the dataset. I recommend you go through it for a better understanding and to have the correct mindset.

Categorical Data

There are high chance that our dataset has a few columns which are having categorical data. Converting such columns to categories will have a drastic effect on the accuracy of the model. Consider an example where we have a student_status column and it has two values 'PASSED' and 'FAILED'.

Since the column values are text, scikit-learn, and many other packages are not capable of dealing with text. So we need to convert the column to an integer to improve model performance. Here we can convert 'PASSED' to 1 and 'FAILED' to 0. Consider an example shown below:

df_students = pd.get_dummies(df)

df_students = pd.get_dummies(df, drop_first=True)

print(df_students.columns)

You can observe in the above example that we are using the get_dummies function provided by pandas to convert columns to categorical columns. You might have observed column names are created in the format of columnName_categoryName.

Once we have categorical columns created, you can continue with the process of generating the model again with a new dataset. There are high chances of improvement of performance in the model.

Normalizing Data

There are algorithms like knn (k-nearest neighbor algorithm) which use distance to make decisions. What if our data varies a lot (i.e. Dataset has high standard deviations)? Can we still expect accurate data models?

We need to normalize our data to reduce the high standard deviation in the dataset. There are various techniques to normalize the data. These are as shown below:

Standardization: Subtract the mean and divide by the variance
Subtract the minimum and divide by the range
Normalize to range data between -1 to +1

Standardization

We can use the scale function to standardize the values. Consider an example shown below:

from sklearn.preprocessing import scale
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np

irisData = load_iris()

X = irisData.data
y = irisData.target

X_scaled = scale(X)

X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y, 
                                               test_size = 0.5, random_state=24)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=24)

knn_scaled = KNeighborsClassifier().fit(X_train_scaled, y_train_scaled)
knn = KNeighborsClassifier().fit(X_train, y_train)

print(knn_scaled.score(X_test_scaled, y_test_scaled))
print(knn.score(X_test, y_test))

print("Standard Deviation of features before scaling: ", np.std(X))
print("Standard Deviation of features after scaling: ", np.std(X_scaled))

In the example shown above to show the difference between scaled and unscaled data, I have:

Computed the knn accuracy score for both scaled and unscaled data.
The standard deviation for both scaled and unscaled data.

Please run the code to see the difference yourself between scaled and unscaled data.

We can also create a pipeline and perform operations to standardize the data. Consider an example shown below:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

irisData = load_iris()

X = irisData.data
y = irisData.target

steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=24)

knn_scaled = pipeline.fit(X_train, y_train)
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

print('Accuracy after scaling: ',knn_scaled.score(X_test, y_test))
print('Accuracy without scaling: ', knn_unscaled.score(X_test, y_test))

It is similar to the first example which we have seen earlier. Here instead of scaling data and then splitting it, we are generating pa ipeline to scale the data.

Summary

In this post, we have learned techniques that must have to followed to generate the optimized model. We learned about Cleaning, Categorical, and Normalizing datasets that can result in much better performance.

We must consider the mentioned techniques to be MUST HAVE followed by our dataset (if possible). As a newbie, there are high chances that the dataset is not following the mentioned optimizations and you are scratching your head to improve the model performance.

Are you Cleaning, Categorical, and Normalizing your dataset before generating models? Please let me know in the comment section below. Happy Learning!

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.