Data Preprocessing in ML (Machine Learning)
Data Preprocessing in ML
Introduction
In the machine learning pipeline, data preprocessing is an essential step. It entails purifying and converting unstructured data into a format that may be used for modelling. Missing values, noise, and inconsistencies are common in real-world datasets, and they can have a detrimental effect on machine learning models’ precision and effectiveness. By applying proper preprocessing techniques, we ensure the data is clean, complete, and optimized for analysis.
Complete Python Course with Advance topics:-Click Here
SQL Tutorial :-Click Here
Why Do We Need Data Preprocessing?
Inconsistencies, noise, duplicate records, and missing values are common in raw data. Feeding such data directly into machine learning models can lead to inaccurate predictions and poor model performance. Data preprocessing helps:
- Improve model accuracy
- Enhance efficiency
- Reduce bias in predictions
- Ensure a structured and usable dataset
Steps in Data Preprocessing
1. Getting the Dataset
The first step is collecting and preparing a dataset. Machine learning models rely on structured data, often stored in CSV files, Excel sheets, or databases. Some common sources for datasets include:
- Kaggle
- UCI Machine Learning Repository
- API-generated data
2. Importing Libraries
To perform data preprocessing in Python, we use essential libraries:
import numpy as np # For numerical operations
import pandas as pd # For data handling
import matplotlib.pyplot as plt # For visualization
3. Importing the Dataset
Once we have our dataset, we import it into Python using Pandas:
dataset = pd.read_csv('data.csv')
print(dataset.head()) # View first few rows
4. Handling Missing Data
There are two approaches to dealing with missing data:
- Deleting rows or columns that have missing values; this is not advised for big datasets.
- Substituting the mean, median, or mode for missing data.
Using Scikit-learn:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
dataset.iloc[:, 1:3] = imputer.fit_transform(dataset.iloc[:, 1:3])
5. Encoding Categorical Data
It is necessary to transform categorical variables since machine learning models operate on numerical data.
Encoding Labels (e.g., Yes/No → 1/0)
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
dataset['Purchased'] = label_encoder.fit_transform(dataset['Purchased'])
Encoding Categories into Dummy Variables
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("encoder", OneHotEncoder(), [0])], remainder='passthrough')
dataset = np.array(ct.fit_transform(dataset))
6. Splitting Dataset into Training and Test Sets
To evaluate the model properly, we split the dataset:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset[:, :-1], dataset[:, -1], test_size=0.2, random_state=42)
7. Feature Scaling
Feature scaling ensures that all variables are on the same scale, preventing bias in the model.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Download New Real Time Projects :-Click here
Complete Advance AI topics:- CLICK HERE
Conclusion
Preparing the data is a crucial stage in the machine learning process. It ensures that raw data is transformed into a structured, clean, and optimized format for better model accuracy and efficiency. By following these steps—handling missing data, encoding categorical variables, splitting data, and scaling features—we prepare datasets for building robust machine learning models.
data preprocessing in ml machine learning python
data preprocessing in ml machine learning research paper
data preprocessing in ml machine learning with example
data preprocessing in ml machine learning geeksforgeeks
data preprocessing in ml machine learning ppt
data preprocessing in python
data preprocessing in machine learning pdf
data preprocessing techniques
data preprocessing in machine learning with example
data preprocessing in python
data preprocessing steps
data preprocessing techniques
data preprocessing in deep learning
data preprocessing in ml machine learning geeksforgeeks
data preprocessing techniques in machine learning python
data preprocessing in ml machine learning pdf
data preprocessing in ml python
data preprocessing in ml geeksforgeeks
Post Comment