Data Science 101:- Data Reduction Techniques Using Python

Darshil Patel
3 min readOct 31, 2021

Data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA.

Data Reduction:

Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.

Dimensionality Reduction:

This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA (Principal Component Analysis).

Principal Component Analysis:

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. For a lot of machine learning applications it helps to be able to visualize your data. Visualizing 2 or 3 dimensional data is not that challenging. You can use PCA to reduce that 4 dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Code:

LinkedIn

--

--

Darshil Patel

CS Grad @ University of Texas at Dallas | Passionate and Versatile Software Engineer | Expertise in Full-Stack Development, Machine Learning, and Data Science