Feature Selection #

Data sets that data scientists work upon arrives from multitude of sources. Hence, transforming data from these multiple sources to a common format is a mandatory step for deriving meaningful inference. Data will also have errors/deviations at the source of origin as well. We have a to make a careful consideration on how to handle such data points since these affect the learning model developed.

The extreme observations, when it influences the model performance are called influential points. The extremities are referred to as outliers or novelties depending on the context and it significance.

Outlier Detection - Removing the extremities so as to ignore those points for model learning.
Novelty Detection - Set of datapoints exist and we need to determine if a new observation fit into the distribution. Anamoly detection, financial frauds comes under novelty detection.

Select un-correlated features

Feature search strategy #

Optimum - Supervised and Unsupervised
Heuristic - Forward, backward
Randomized

Optimum Subset evaluation strategy #

Unsupervised
Supervised

Algorithms #

Filter methods (Unsupervised)
Wrapper Methods (Supervised)
Embedded Methods

Filter Methods #

Apply statistical measure and score attributes accordingly.
Based on the score drop or keep the feature
Either univariate (no dependency amongst features) or with regards to dependent features

e.g. Chi-Squared test, information gain and correlation co-efficient scores

Wrapper methods #

Different subset of features are chosen and compared after evaluation
Score is given based on model accuracy for the subset of features
best-first search, hill climbing algorithm, backward and forward features

E.g. Recursive feature elimination algorithm

Embedded methods #

Learns which feature is contributes to best model accuracy at model creation time
Common methods are regularization methods
Regularization methods (also called penalization methods) introduces additional constraints

E.g. Lasso, Ridge and Elastic Net

Outlier Detection #

Outlier Detection Methods

Univariate analysis #

Pearson correlation coefficient
F score
Chi Square
Signal to noise ratio
Mutual information

Box-Plot #

Scatter Plot #

IQR Score #

Cook’s Distance #

Z-Score #

Multivariate analysis #

Principal Component Analysis (PCA) #

Local Outlier Factor (LOF) #

High Contrast Subspaces for Density-Based Outlier Ranking (HiCS) #

Handling Outliers #

Drop Outlier #

Impute with Mean/Median #

Winsorizing #

Log-Scale Transformation #

Binnning #

Using different models #

Tree based models, such as random forests, are less impacted by the outliers. These models split the dataset into distinct, non-everlapping regions and compute the residual errors for each.

Using different loss functions to measure model performance #

Truncated loss function