Data Science 101:- Data Preprocessing

Darshil Patel
4 min readSep 27, 2021

Perform following Data Pre-processing tasks in python using Scikit-Learn library

Introduction

Data preprocessing is a data analysis process that starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted.

In this blog, I have performed Data preprocessing in python using the Scikit-Learn library. There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Data Encoding

Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets, etc., into a specified format. In this, we assign unique values to all the categorical attributes. like pass as 1 and fail as 0.

Label Encoder

Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.

One-Hot Encoder

Though label encoding is straight it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

Normalization

Normalization is the process where the values are scaled in a range of -1,1 i.e. converting the values to a common scale. This ensures that the large values in the data set do not influence the learning process and have a similar impact on the model’s learning process. The function normalize provides a quick and easy way to perform this operation on a single array-like dataset.

Standardization

Data standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1. Standardization of datasets is a common requirement for many machine learning estimators implemented in Scikit-learn. The preprocessing module provides the StandardScaler utility class, which is a quick and easy way to perform Standardization.

Imputation of missing values

For various reasons, many real-world datasets contain missing values. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.

We remove the missing values when the ratio of the number of missing values and a total number of values is low. Here, we can remove missing values using dropna() in pandas. If the ratio is high so we have to impute the values.

Checked for null values in the dataset
After Imputation, there are no null values in the dataset

Discretization

Data discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. Basically, a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss.

Conclusion

There is a lot more in data preprocessing I discussed some of the common methods. You can learn more about it in the documentation.

That’s it. Hope you find this blog helpful. Check out the entire code on my GitHub profile.

LinkedIn:

--

--

Darshil Patel

CS Grad @ University of Texas at Dallas | Passionate and Versatile Software Engineer | Expertise in Full-Stack Development, Machine Learning, and Data Science