Why? #
Engineer the data suitable for building faster and simple models.
Steps,
- Clean
- Format
Data Quality issues #
- Missing values - Values not present. e.x. User’s unwillingness, non mandatory, storage issues, not applicable info, n/w & storage errors
- Duplicate data - Redundant data objects. Customer added new address, customers sharing address
- Inconsistent/Invalid data - Impossible values e.x. negative age, income, invalid zip code; Data entry errors
- Noise - Meaningless or invalid data; e.x. special chars
- Outliers - Data that is considerably away from the general behavior data
Data Wrangling #
- Feature selection - Adding or removing feartures
- Feature transformation - Scaling, dimensionality reduction
Handling Missing values #
Impact #
- Python libraries (scikit) incompatible with missing values
- Incorrect imputation may distort variable distribution
- Affects performance of model
Missingness Mechanism #
Finding why the data is missing would help decide the imputation logic.
Types,
- Missing completely at random (MCAR) - Probability of an instance missing the value is not dependent on the known values or the missing data
- Missing at Random (MAR) - Probability of missing values depend on the known values but not on the missing data itself
- Not missing at Random (NMAR/MNAR) - Probability of missing value could depend on the the value of the target variable
Missing completely at random (MCAR) #
- Probability of data missing is same for all observations.
- No relationship b/w missing values and other attributes of the dataset
- Dropping ruch records will not affect inference
Missing at Random (MAR) #
- Missing data in dependent on other attributes. Can impute values based on the other attribute
Not missing at Random (NMAR) #
- Missing values exist as an indication of the target class
Imputation Techniques for numeric values #
- Numerical Imputation - Mean/Median imputation, Random sampling imputation, Arbitary values imputation, Using values at the end of the distribution
- K-Nearest Neighbours (other variations - Fuzzy kNN, Sequential-kNN)
- Singular value decomposition (SVD)
- Baysean Principal component analysis (bPCA)
- Multiple imputations by chained equations (MICE)
- Expectation-Maximization (EM)
Imputation Techniques for categorical values #
- Imputation by Mode
Note : Adding new attribute to indicate missingness is a practice followed
Handling Duplicate data #
- Delete duplicate data
- Merge duplicate data
External sources can be used to identify correct data.
Handling Invalid data #
- Use external knowledge bases or paid services
- Domain knowledge and common reasoning to come up with reasonalbe estimate
Handling Noise #
- Filter out noise
- May result in partial loss of data if not done carefully
Handling Outliers #
-
Linear Regression, K-Nearest Neighbours and Adaboost are sensitive to outliers
-
Significantly skews distribution of data
-
Identified using summary stats and plots of data
-
In normal distribution ignore +/-3 standard deviations
-
In skewed use IQR
Median/Mode to the left of mean is a +ve skew.
Mean/Median Impuatation
- MCAR and MCR
- Assumes feature is normally distributed e.x. age
- Mean is sensitive to outliers and use median
Pros
- Easy to implement
- Easier to obtain complete dataset
Cons
- Reduces variance of the feature
- Does not preserve relationship b/w features i.e. corelation and covariance
Random Sampling Impuation
- MCAR and MCR
- Preserve the statistical parameter of the feature (mean and variance are not distorted)
Adding new feature
- MCAR and MCR
- Adds new feature
Impuation by tail values
- Used or NMAR
- Sort feature values and use the tail values for the missing values
- aka smoothing or binning
Imputation byu arbitary values
- Used for NMAR
- Chose au random value from the feature expect mean or median
Impuation for categorcial value Imput by Mode
- Used for NMAR
Encode missing value as a category
Aggregation #
- Reduces variability in dataset and random roise
- Combining two or more attributes(e.x. height + weight = BMI) or two or more objects together (e.x. Grouping all male together)
Purpose #
- Data Reduction
- Change of scale [Consider people names into males and female]
- More stable data [Reduces random noise]
Methods,
Continues - Mean, Min. Max Discreate - Count, % of counts
Sampling #
- Processing big datasets is expensive
- A representative sample is chosen to build model
Representative sample has approx same prop as the original population
Simple random sampling #
- When dataset is small go for w/o replacement
- Sampling w/ replacement
- Sampling w/o replacement
Stratified sampling #
- When class imbalance is present
- Group by class and then pick samples to ensure equal distribution
References #
http://www.ukm.edu.my/jsm/pdf_files/SM-PDF-44-3-2015/17%20NuryAzmin.pdf
5 - Label encoding
Target encoding
Outlier Detection - DB Scan