Introduction to machine learning
What the heck is machine learning?
If I would have to quote it in a single sentence, I would say 'Machine Learning is a way to find a pattern in data to predict the future.'
This is not an only definition of machine learning. There are lot more definitions you can find based on what you are trying to achieve.
The process of Machine Learning
Suppose we have a large set of data that has some pattern, which is not possible for a human brain to identity.
- We pass this data to Machine Learning algorithm which studies the pattern and provides us the model.
- The provided model can be used by the applications to provide new data to model and checks whether a pattern exists in the data provided.
Now the question is do we always have data which has some patterns?
The answer is NO. We being a data Scientist will be provided with the raw data. And it will be our role and responsibility to perform data transformation/manipulation using tools. Then this processed data (or training data) can be used as input to machine learning algorithms.
Types of a machine Learning
- Supervised Learning: This means is the value or result that we want to predict is within the training data. And the value which is in data that we want to study is known as a target value.
- Unsupervised Learning: This means is the value or result that we want to predict is not there in the training data.
Categorizing Machine Learning problem
- Regression: Used for supervised data. Here we try to find out a line of the curve from our training data.
- Categorization: Used for supervised data. We split our data into classes. And when new data comes in we try to figure out what data belongs to which class.
- Clustering: Used for unsupervised data. Here we classify our data into clusters.
Styles of Machine Learning Algorithm
- Decision tree
- Neural Network: The way our brain works
We will be talking about these styles in upcoming posts. At this point in time, we should only know that these styles exist.
Machine Learning Workflow
- Asking the right question
- Preparing data
- Selecting the algorithm
- Training the model
- Testing the model
Let's see the workflow in detail.
Asking the right question
It is very important to know what do you want from your data. And whether you can get the desired results you are looking for from the data that you have or not.
If the question that you are asking is not correct, then there are chances that you won't get the desired results when you have your model ready. So, asking the right question is very important to predict from the data.
It is the most crucial step of the entire process. Data Scientist spends most of there time in preparing data. In most of the cases, time spent on preparing data is more than 50% of the entire process.
The major steps of cleaning data include Loading, Exploring, cleaning, imputing and molding data.
Let me explain more about imputing options. Most of the times in the data that we have are having null or missing values. So how to deal with such situation as it may cause bias results. There are various options:
- Ignore it
- Delete the rows which are having missing data
- Replace values or Impute
It is easy to ignore or delete rows. But what if out 1000 rows 400 rows are having missing data? Will it be okay to delete 400 rows? Certainly answer is NO.
We have to replace the value in such case. One way to replace a missing value is with mean or medium.
Selecting the algorithm
Selecting the right algorithm is essential to get the desired results we are looking for. We pass training data that we have prepared in the previous step to the algorithm, and an algorithm calculates data and returns the model. Then to the model, we pass real data that we want to predict.
There is more than 50 machine learning algorithm which has been created. Choosing a correct algorithm is challenging. So the algorithm is selected based on factors:
- The type of problem we are trying to solve.
- Its also depends on data Scientist to choose his factors for selecting an algorithm.
- Most importantly experience plays an important role to choose factors on which algorithm is chosen.
The general technique that most data scientist follows to choose correct algorithm is elimination. We can eliminate and come closer to right algorithm based on:
- Supervised learning or unsupervised learning
- Regression or Classification or Clustering
- Initially, it is safe to eliminate ensemble algorithms. These are the algorithms that have many child algorithms.
- We can also eliminate algorithm based on basic or advance. Enhanced algorithms are improvements in basic algorithm. As a beginner, we can start with basic algorithms.
In my upcoming posts, I will explain about some of the most used algorithms in details.
Training the model
Training our model is an important step. Usually, as our data change with time we need to retrain our model to predict the right results.
To train the model we usually split our prepared data into:
- Training data
- Test data
Training data is the data that is used to create a model. And test data is the data for which we already know the result that we are looking for. So it is passed to our model created by training data to check the accuracy of our model.
Mostly 70% of prepared data is used as training data and the remaining 30% is used as test data.
The columns that we use to train our model is known as features. One way to improve the performance of the model is by using the minimum features to train the model. If you are following along you would remember I have shared some of the python package that is used for machine learning. Please visit the link.
Additionally, scikit-learn library which is used for:
- Splitting data into training and test data
- Model training
- Model tuning - Improving performance of a model
You will read more about scikit-learn package in future posts.
Testing the model
Once we have our model ready its time to test the model. Remember we have 30% of test data from the prepared data? With the test data, we will test our model and see the accuracy of a model.
The accuracy of test data and training data should be close. Then we can say that our test is successful. But it is not the case every time. There are some challenges that we face while testing the model:
Overfitting of data
It means that the algorithm that we have used understands the training data so well and start training itself based on the training data. The result is that accuracy of testing data falls gradually as compared to training data.
How can we fix the overfitting?
We can control it with the help of regularization hyperparameter.
NOTE: This parameter has many names based on the algorithm we are using. So it is high recommended to read documentation to control overfitting in better way.
Another way to control overfitting is using cross-validation.
Cross-validation is the technique where training data is split into k folds and every time one fold is used as test data and remaining folds are used as training data.
Some of the algorithms have cross validation version as well generally denoted by <Algorithm name>CV.
NOTE: Both regularization hyperparameter and cross-validation can be used at the same time to control overfitting.
Now we are clear with the theory part of the workflow of machine learning. I will be demonstrating and explaining the workflow with an example with the help of Python code that will generate a model. And we can test the model with real data.