Data Preprocessing

Why? #

Engineer the data suitable for building faster and simple models.

Steps,

Clean
Format

Data Quality issues #

Missing values - Values not present. e.x. User’s unwillingness, non mandatory, storage issues, not applicable info, n/w & storage errors
Duplicate data - Redundant data objects. Customer added new address, customers sharing address
Inconsistent/Invalid data - Impossible values e.x. negative age, income, invalid zip code; Data entry errors
Noise - Meaningless or invalid data; e.x. special chars
Outliers - Data that is considerably away from the general behavior data

Data Wrangling #

Feature selection - Adding or removing feartures
Feature transformation - Scaling, dimensionality reduction

Handling Missing values #

Impact #

Python libraries (scikit) incompatible with missing values
Incorrect imputation may distort variable distribution
Affects performance of model

Missingness Mechanism #

Finding why the data is missing would help decide the imputation logic.

Types,

Missing completely at random (MCAR) - Probability of an instance missing the value is not dependent on the known values or the missing data
Missing at Random (MAR) - Probability of missing values depend on the known values but not on the missing data itself
Not missing at Random (NMAR/MNAR) - Probability of missing value could depend on the the value of the target variable

Missing completely at random (MCAR) #

Probability of data missing is same for all observations.
No relationship b/w missing values and other attributes of the dataset
Dropping ruch records will not affect inference

Missing at Random (MAR) #

Missing data in dependent on other attributes. Can impute values based on the other attribute

Not missing at Random (NMAR) #

Missing values exist as an indication of the target class

Imputation Techniques for numeric values #

Numerical Imputation - Mean/Median imputation, Random sampling imputation, Arbitary values imputation, Using values at the end of the distribution
K-Nearest Neighbours (other variations - Fuzzy kNN, Sequential-kNN)
Singular value decomposition (SVD)
Baysean Principal component analysis (bPCA)
Multiple imputations by chained equations (MICE)
Expectation-Maximization (EM)

Imputation Techniques for categorical values #

Imputation by Mode

Note : Adding new attribute to indicate missingness is a practice followed

Handling Duplicate data #

Delete duplicate data
Merge duplicate data

External sources can be used to identify correct data.

Handling Invalid data #

Use external knowledge bases or paid services
Domain knowledge and common reasoning to come up with reasonalbe estimate

Handling Noise #

Filter out noise
May result in partial loss of data if not done carefully

Handling Outliers #

Linear Regression, K-Nearest Neighbours and Adaboost are sensitive to outliers
Significantly skews distribution of data
Identified using summary stats and plots of data
In normal distribution ignore +/-3 standard deviations
In skewed use IQR

Median/Mode to the left of mean is a +ve skew.

Mean/Median Impuatation

MCAR and MCR
Assumes feature is normally distributed e.x. age
Mean is sensitive to outliers and use median

Pros

Easy to implement
Easier to obtain complete dataset

Cons

Reduces variance of the feature
Does not preserve relationship b/w features i.e. corelation and covariance

Random Sampling Impuation

MCAR and MCR
Preserve the statistical parameter of the feature (mean and variance are not distorted)

Adding new feature

MCAR and MCR
Adds new feature

Impuation by tail values

Used or NMAR
Sort feature values and use the tail values for the missing values
aka smoothing or binning

Imputation byu arbitary values

Used for NMAR
Chose au random value from the feature expect mean or median

Impuation for categorcial value Imput by Mode

Used for NMAR

Encode missing value as a category

Aggregation #

Reduces variability in dataset and random roise
Combining two or more attributes(e.x. height + weight = BMI) or two or more objects together (e.x. Grouping all male together)

Purpose #

Data Reduction
Change of scale [Consider people names into males and female]
More stable data [Reduces random noise]

Methods,

Continues - Mean, Min. Max Discreate - Count, % of counts

Sampling #

Processing big datasets is expensive
A representative sample is chosen to build model

Representative sample has approx same prop as the original population

Simple random sampling #

When dataset is small go for w/o replacement

Sampling w/ replacement
Sampling w/o replacement

Stratified sampling #

When class imbalance is present
Group by class and then pick samples to ensure equal distribution

References #

https://www.hilarispublisher.com/open-access/a-comparison-of-six-methods-for-missing-data-imputation-2155-6180-1000224.pdf

http://www.ukm.edu.my/jsm/pdf_files/SM-PDF-44-3-2015/17%20NuryAzmin.pdf

5 - Label encoding

Target encoding

Outlier Detection - DB Scan