Data Transformation #
- Normalization
- Attribute Construction
- Aggregation
- Attribute subset selection
- Discretization
- Generalization
Normalization #
- Regression co-efficients depend on magnitude of feature variable
- Features with bigger magnitude overshadow the smaller ones
- Eculidian distance measure is sensitive to bigger mangnitude
Algorithm Sensitive to magnitude #
- Linear and logistic regression
- Neural networks
- Support vector machines
- kNN
- K-Means clustering
- Linear discriminant analysis (LDA)
- Principal component analysis (PCA)
Types #
- Min-Max normalization
- Z-Score normalization
- Decimal scaling normalization
Discretization #
- Converting continuous values into discrete values
- How to you choose number of patritions/bins
- Where to put the cut point
Depends on the problem being studied.
Unsupervised Discretization #
- Class label info not used
- Bins decided by experiment
Strategy,
- Equal interval binning - Outliers might skew
- Equal frequency binning - Same value might across different bins
- Binning by clustering
Heuristics,
- Number of intervals should not be smalled than the # of class
- number of features = number of samples / (3 * class types)
Supervised discretization #
- Class label info is used
- Entropy based discretization
- Maximize purity of info i.e. contain less class mixture