Data Transformation

Data Transformation #

  • Normalization
  • Attribute Construction
  • Aggregation
  • Attribute subset selection
  • Discretization
  • Generalization

Normalization #

  • Regression co-efficients depend on magnitude of feature variable
  • Features with bigger magnitude overshadow the smaller ones
  • Eculidian distance measure is sensitive to bigger mangnitude

Algorithm Sensitive to magnitude #

  • Linear and logistic regression
  • Neural networks
  • Support vector machines
  • kNN
  • K-Means clustering
  • Linear discriminant analysis (LDA)
  • Principal component analysis (PCA)

Types #

  • Min-Max normalization
  • Z-Score normalization
  • Decimal scaling normalization

Discretization #

  • Converting continuous values into discrete values
  • How to you choose number of patritions/bins
  • Where to put the cut point

Depends on the problem being studied.

Unsupervised Discretization #

  • Class label info not used
  • Bins decided by experiment

Strategy,

  • Equal interval binning - Outliers might skew
  • Equal frequency binning - Same value might across different bins
  • Binning by clustering

Heuristics,

  • Number of intervals should not be smalled than the # of class
  • number of features = number of samples / (3 * class types)

Supervised discretization #

  • Class label info is used
  • Entropy based discretization
  • Maximize purity of info i.e. contain less class mixture