Data Preprocessing in ML

4 min readJul 27, 2022

We often hear this word while working with Machine Learning models and it is considered an essential part of a Machine Learning life cycle. So, Today in this blog I will be explaining to you what exactly is Data Preprocessing and how it works.

What is Data Preprocessing?

Data Preprocessing as the word suggests is the preprocessing of data. It means before the data is processed in the model for Training and Testing, we preprocess it for making and bringing it in a correct format.
You check and analyze the raw data and work on it to get a desired format of data, so that further in the model building we do not face any kind of Data issues.
The main preprocessing is done on the data like missing values, wrong formats, noisy data(which is useless data), cleaning data, etc.
Data should be clean, no missing values must be there, and should be useful, so that we can supply that data to our model for Training and Testing.

What are the steps of Data Preprocessing?

Steps in Data Preprocessing are as follows:-
1. Importing the Libraries

In this very 1st step, we import all the libraries that we require in our model. we do this so that easily we can use them throughout our algorithm.

2. Importing the Dataset

In this step, we import our Dataset using pandas, so that our program can read the CSV file and we can use data to work on it for predicting results.

3. Taking care of Missing Data

To get better predictions, it is important that our data should be accurate and no missing values should be there in the dataset tas to avoid inconsistencies in the model. For this, we will be using a simple imputer. Simple Imputer is the class of sklearn python library and the main purpose is to handle the missing data.

4. Encoding Categorical Data

Encoding categorical data is a pure form of conversion of categorical data which means different categories to integer format. For example: which students have completed their Homework, it will 2 categories either yes or No. So, categorical values are converted to Binary form to predict the outcome for the next input.

5. Encoding Independent Variable

6. Encoding dependent variable

7. Splitting the Data into Training and Testing

To train the model and examine the data, we need to split the data set into 2 parts which are training data and testing data. We mostly follow the 80/20 rule or 70/30 for splitting the dataset which means 80% training data and 20% testing. The more the training, the more accurately the model will predict the outcome and calculations will be appropriate for better performance of the model.

8. Feature Scaling

Feature scaling is a useful technique to balance and standardize the independent features that are present in a fixed range. It is important to scale the features so that our model does not take high values. Two techniques for Feature scaling are Normalization and Standardization.

This was all about Data Pre-processing in Machine learning. This step we perform for processing of data and separating the dataset.

Data Preprocessing in ML

Written by Gauri Guglani

No responses yet