Accuracy of models using python

Tavish Aggarwal

We have a model designed and is ready to deploy on production. But before deploying it is very important to test the accuracy of the model. Most of the time data scientists tend to measure the accuracy of the model with model performance. Some of us might think we already did that using score() function.

Consider an example shown below where we are performing a k-nearest neighbor algorithm to solve the classification model.

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Loading data
irisData = load_iris()

# Create feature and target arrays
X = irisData.data
y = irisData.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train,y_train)

# Calculate the accuracy of the model
print(knn.score(X_test, y_test))

In the example shown above, we are using score function to check the accuracy of the model.

The question which arises here is, Is it always correct to really on score function to check the accuracy of the model?

Consider an example of a fraud transaction, where most of the transactions are genuine. And only chance of 1% fraud transaction.

In this scenario, if we consider the accuracy percentage to check the model performance, then we will be having 99% accuracy which may not be true.

This is known as the class imbalance. In such scenarios, we cannot rely on the score function.

Let's look at a few techniques which will be helpful and will be important tools for you to solve machine learning problems.

Confusion Metrics

How do we know which algorithm to use to solve machine learning problems when there are so many algorithms available. The best way is to do the following steps:

  1. Divide dataset into test and training dataset.
  2. Train the dataset with all the potential algorithm that you think may be used to solve the problem.
  3. Then perform confusion metrics on the test dataset with different algorithms used.

In simple terms, confusion metrics are the metrics where we try to find out actual vs predicted results. We divide the result from confusion metrics as:

  • True Positive: When predicted is true and actual is also true.
  • True Negative: When predicted is false and actual is also false.
  • False Positive: When predicted is true and actual is false.
  • False Negative: When predicted is false and actual is true.

NOTE: Always remember rows correspond to the predicted value and columns correspond to the actual value in the confusion metrics.

Let's look at an example:

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report , confusion_matrix

# Loading data
irisData = load_iris()

# Create feature and target arrays
X = irisData.data
y = irisData.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train,y_train)

y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Based on the results we can choose the algorithm and also calculate the performance of the model. But sometimes it's hard to choose between algorithms as their confusion metrics are almost identical.

In that case, what to do? We need to calculate more sophisticated metrics like Sensitivity, Specificity, ROC, AUC etc that can help us to make more precise decisions.

 Sensitivity

Sensitivity tells us the percentage of the positive correctly identified results by the model. It is true positive / true positive + false negative.

Specificity

Specificity tells us the percentage of the negative correctly identified results by the model. It is true negative / true negative + false postive.

ROC (Reciever Operator Characteristics) 

Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models.

ROC curve plots the true positive rate vs false positive rate.

# Import necessary modules
from sklearn.metrics import roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

data = load_breast_cancer()

# Create feature and target arrays
X = data.data
y = data.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train,y_train)

# Compute predicted probabilities: y_pred_prob
y_pred_prob = knn.predict_proba(X_test)[:,1]


# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

NOTE: 

  1. The threshold is the probability range from 0 to 1.
  2. Most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class.

AUC (Area Under the Curve)

AUC curve suggests the area under the ROC curve. Consider an example where we have used two algorithms and AUC for the 1st algorithm ROC curve is greater than the area under the 2nd algorithm ROC curve, then we should consider going for the 1st algorithm.

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

data = load_breast_cancer()

# Create feature and target arrays
X = data.data
y = data.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train,y_train)

logreg = LogisticRegression()
logreg.fit(X_train,y_train)

# Compute predicted probabilities: y_pred_prob
y_pred_prob_knn = knn.predict_proba(X_test)[:,1]
y_pred_prob_logreg = logreg.predict_proba(X_test)[:,1]

print("AUC for knn: {}".format(roc_auc_score(y_test, y_pred_prob_knn)))
print("AUC for logreg: {}".format(roc_auc_score(y_test, y_pred_prob_logreg)))

# Compute cross-validated AUC scores: cv_auc
cv_auc_knn = cross_val_score(knn, X, y, scoring='roc_auc', cv=5)
cv_auc_logreg = cross_val_score(logreg, X, y, scoring='roc_auc', cv=5)
# Print list of AUC scores
print("AUC scores using 5-fold cross-validation knn: {}".format(cv_auc_knn.mean()))
print("AUC scores using 5-fold cross-validation logreg: {}".format(cv_auc_logreg.mean()))

In the example shown above, we are calculating AUC for the same data using the nearest neighbor algorithm and Logistic Regression algorithm. Based on the results we can choose the algorithm that has higher AUC value.

There are more calculations that we can calculate from confusion metrics:

Precision: true positive / true positive + false positive

Recall true positive / true positive + false negative

Accuracy: true positive + true negative / true positive + true negative + false positive + false negative

Hyperparameter Tuning

We discussed models like k-nearest neighbor algorithm, regression algorithms there are few constants like n_neighbors for the k-nearest neighbor algorithm that we need to assign arbitrary value. How do we decide the value for these constants?

The possible solution is hyperparameter tuning in which we define the space of possible values that we think can have better performance. We evaluate our model by all the possible values defined and check the accuracy of the model.

Let's look at an example shown below:

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

irisData = load_iris()

# Create feature and target arrays
X = irisData.data
y = irisData.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

n_neighbors_space = list(range(1, 11))
param_grid = {'n_neighbors': n_neighbors_space}

knn = KNeighborsClassifier()

knn_cv = GridSearchCV(knn, param_grid, cv=5)
# Fit it to the data
knn_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned k Nearest Algorithm Parameters: {}".format(knn_cv.best_params_))
print("Best score is {}".format(knn_cv.best_score_))

Note: That we have used GridSearchCV function here, which will create a grid of all the possible values defined and the accuracy against the values.

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. 

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

irisData = load_iris()

# Create feature and target arrays
X = irisData.data
y = irisData.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

n_neighbors_space = list(range(1, 11))
param_grid = {'n_neighbors': n_neighbors_space}

knn = KNeighborsClassifier()

# Here we are using RandomizedSearchCV instead of GridSearchCV
knn_cv = RandomizedSearchCV(knn, param_grid, cv=5)

# Fit it to the data
knn_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned k Nearest Algorithm Parameters: {}".format(knn_cv.best_params_))
print("Best score is {}".format(knn_cv.best_score_))

Here in the example shown above, we are using RandomizedSearchCV function instead of GridSearchCV. You will notice that the output is the same. The only difference will be at computational power.

Summary

In this post, we have explored options to calculate the performance of the model.

  • We studied about confusion metrics and classification report.
  • We thereafter studied ROC and AUC curve which can be used to evaluate the performance of models created using various algorithms.
  • We also looked at hyperparameter tuning to choose the best values for the tunning the model.

Hope that you will all the techniques discussed above in your data science toolkit and use them. Happy Learning!

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Tavish Aggarwal is a front-end Developer working in a Hyderabad. He is very passionate about technology and loves to work in a team.