Table of Contents
- Handling Class Imbalance: Augmentation and Penalization
- Feature Selection: Dealing with High Dimensionality Problem
- Underfitting and Overfitting: The Bias-Variance Trade-Off
- Model Evaluation: Metrics
Short Summary: An overview of techniques for building a machine learning model.
Handling Class Imbalance: Augmentation and Penalization
- Undersample the majority class.
- Problematic since removing samples means getting rid of data.
- Oversample the minority class (adding duplicate samples).
- Problematic since duplicating data makes some samples more important than others.
- Augment the minority class via SMOTE or other approaches (adds small perturbations).
- Problematic since adding synthetic data can introduce bias or complexity.
- Modify the loss function such that the error on minority class is penalized more.
- Weight of the class is the size of the largest class divided by the size of the class.
Feature Selection: Dealing with High Dimensionality Problem
- Use Pearson correlation to remove features with low target-feature correlation.
- Use Pearson correlation to remove feature groups with high feature-feature correlation.
- Use model-based feature selection:
- L1 regularization.
- Elastic net regression (L1 regularization + L2 regularization).
Underfitting and Overfitting: The Bias-Variance Trade-Off
- Detection:
- Look at loss plots.
- Overfitting if loss is low in training, but high in testing.
- Underfitting if loss is high in training (and likely in testing).
- Perform K-fold cross-validation or leave-one-out cross-validation.
- Look at loss plots.
- Approaches for dealing with underfitting (high bias, low variance):
- Increase model complexity.
- Try a larger set of features.
- Approaches for dealing with overfitting (high variance, low bias):
- Use pruning in decision trees, random forest, or gradient boosting over decision trees.
- Use L1 regularization, L2 regularization, or elastic net regression in regression.
- Use weight decay and batch normalization (or dropout layers) in neural networks.
- Get more training data or try a smaller set of features (early stopping may also help).
Model Evaluation: Metrics
- Offline metrics:
- In balanced datasets, can use accuracy.
- In unbalanced datasets, use precision, recall, F-score, and AUC.
- Online metrics (empirical results observed in user interactions with real-time system).
- A/B testing.
- Comparing two models and determining the statistical significance of the difference.
- Use testing data to generate prediction tensors for both models and run paired t-test.