Data Preprocessing

Why? #

Engineer the data suitable for building faster and simple models.

Steps,

  • Clean
  • Format

Data Quality issues #

  • Missing values - Values not present. e.x. User’s unwillingness, non mandatory, storage issues, not applicable info, n/w & storage errors
  • Duplicate data - Redundant data objects. Customer added new address, customers sharing address
  • Inconsistent/Invalid data - Impossible values e.x. negative age, income, invalid zip code; Data entry errors
  • Noise - Meaningless or invalid data; e.x. special chars
  • Outliers - Data that is considerably away from the general behavior data

Data Wrangling #

  • Feature selection - Adding or removing feartures
  • Feature transformation - Scaling, dimensionality reduction

Handling Missing values #

Impact #

  • Python libraries (scikit) incompatible with missing values
  • Incorrect imputation may distort variable distribution
  • Affects performance of model

Missingness Mechanism #

Finding why the data is missing would help decide the imputation logic.

Types,

  • Missing completely at random (MCAR) - Probability of an instance missing the value is not dependent on the known values or the missing data
  • Missing at Random (MAR) - Probability of missing values depend on the known values but not on the missing data itself
  • Not missing at Random (NMAR/MNAR) - Probability of missing value could depend on the the value of the target variable

Missing completely at random (MCAR) #

  • Probability of data missing is same for all observations.
  • No relationship b/w missing values and other attributes of the dataset
  • Dropping ruch records will not affect inference

Missing at Random (MAR) #

  • Missing data in dependent on other attributes. Can impute values based on the other attribute

Not missing at Random (NMAR) #

  • Missing values exist as an indication of the target class

Imputation Techniques for numeric values #

  • Numerical Imputation - Mean/Median imputation, Random sampling imputation, Arbitary values imputation, Using values at the end of the distribution
  • K-Nearest Neighbours (other variations - Fuzzy kNN, Sequential-kNN)
  • Singular value decomposition (SVD)
  • Baysean Principal component analysis (bPCA)
  • Multiple imputations by chained equations (MICE)
  • Expectation-Maximization (EM)

Imputation Techniques for categorical values #

  • Imputation by Mode

Note : Adding new attribute to indicate missingness is a practice followed

Handling Duplicate data #

  • Delete duplicate data
  • Merge duplicate data

External sources can be used to identify correct data.

Handling Invalid data #

  • Use external knowledge bases or paid services
  • Domain knowledge and common reasoning to come up with reasonalbe estimate

Handling Noise #

  • Filter out noise
  • May result in partial loss of data if not done carefully

Handling Outliers #

  • Linear Regression, K-Nearest Neighbours and Adaboost are sensitive to outliers

  • Significantly skews distribution of data

  • Identified using summary stats and plots of data

  • In normal distribution ignore +/-3 standard deviations

  • In skewed use IQR

Median/Mode to the left of mean is a +ve skew.

Mean/Median Impuatation

  • MCAR and MCR
  • Assumes feature is normally distributed e.x. age
  • Mean is sensitive to outliers and use median

Pros

  • Easy to implement
  • Easier to obtain complete dataset

Cons

  • Reduces variance of the feature
  • Does not preserve relationship b/w features i.e. corelation and covariance

Random Sampling Impuation

  • MCAR and MCR
  • Preserve the statistical parameter of the feature (mean and variance are not distorted)

Adding new feature

  • MCAR and MCR
  • Adds new feature

Impuation by tail values

  • Used or NMAR
  • Sort feature values and use the tail values for the missing values
  • aka smoothing or binning

Imputation byu arbitary values

  • Used for NMAR
  • Chose au random value from the feature expect mean or median

Impuation for categorcial value Imput by Mode

  • Used for NMAR

Encode missing value as a category

Aggregation #

  • Reduces variability in dataset and random roise
  • Combining two or more attributes(e.x. height + weight = BMI) or two or more objects together (e.x. Grouping all male together)

Purpose #

  • Data Reduction
  • Change of scale [Consider people names into males and female]
  • More stable data [Reduces random noise]

Methods,

Continues - Mean, Min. Max Discreate - Count, % of counts

Sampling #

  • Processing big datasets is expensive
  • A representative sample is chosen to build model

Representative sample has approx same prop as the original population

Simple random sampling #

  • When dataset is small go for w/o replacement
  • Sampling w/ replacement
  • Sampling w/o replacement

Stratified sampling #

  • When class imbalance is present
  • Group by class and then pick samples to ensure equal distribution

References #

https://www.hilarispublisher.com/open-access/a-comparison-of-six-methods-for-missing-data-imputation-2155-6180-1000224.pdf

http://www.ukm.edu.my/jsm/pdf_files/SM-PDF-44-3-2015/17%20NuryAzmin.pdf

5 - Label encoding

Target encoding

Outlier Detection - DB Scan