Table of Contents
- Handling Class Imbalance: Augmentation, Penalization, and Ensembling
- Feature Selection: Curse of Dimensionality
- Underfitting and Overfitting: Bias-Variance Trade-Off
- Model Evaluation: Metrics
Short Summary: An overview of techniques for building a machine learning model.
Handling Class Imbalance: Augmentation, Penalization, and Ensembling
- Undersample the majority class.
- Problematic since removing samples means getting rid of data.
- Oversample the minority class (adding duplicate samples).
- Problematic since duplicating data makes some samples more important than others.
- Augment the minority class via SMOTE or other approaches (adds small perturbations).
- Problematic since adding synthetic data can introduce bias or complexity.
- Modify the loss function such that the error on minority class is penalized more.
- Weight of the class is the size of the largest class divided by the size of the class.
- Take the majority class, divide into clusters, and replace the examples with the cluster centroids
- Learn \(k\) separate models: take \(\frac{n}{k}\) samples of the majority class and all the samples of the minority class (an ensemble of \(k\) models)
Feature Selection: Curse of Dimensionality
- Use Principle Component Analysis (PCA)
- Use Linear Discriminant Analysis (LDA)
- Use Pearson correlation to remove features with low target-feature correlation.
- Use Pearson correlation to remove feature groups with high feature-feature correlation.
- Use model-based feature selection:
- L1 regularization.
- Elastic net regression (L1 regularization + L2 regularization).
Underfitting and Overfitting: Bias-Variance Trade-Off
- Detection:
- Look at loss plots.
- Overfitting if loss is low in training, but high in testing.
- Underfitting if loss is high in training (and likely in testing).
- Perform K-fold cross-validation or leave-one-out cross-validation.
- Look at loss plots.
- Approaches for dealing with underfitting (high bias, low variance):
- Increase model complexity.
- Try a larger set of features.
- Approaches for dealing with overfitting (high variance, low bias):
- Decrease model complexity.
- Use pruning in decision trees, random forest or gradient boosting over decision trees.
- Use L1 regularization, L2 regularization, or elastic net regression.
- Use weight decay, dropout layers, or batch normalization in neural networks.
- Get more training data.
- Try a smaller set of features as more features can lead to:
- More dimensions and therefore, more sparsity.
- Increased complexity and redundancy of features that are not at all related to prediction.
- Early stopping.
Model Evaluation: Metrics
- Offline metrics:
- In balanced datasets, can use accuracy.
- In unbalanced datasets, use precision, recall, F-score, and AUC.
- Precision@\(k\): compute precision for TOP \(k\) recommendations
- Online metrics (empirical results observed in user interactions with real-time system).
- A/B testing.
- Click-Through Rate (CTR): \(\frac{\text{Clicks (Number of Users Clicking the Advertisement)}}{\text{Impressions (Number of People Who Saw the Advertisement)}}\)
- Comparing two models and determining the statistical significance of the difference.
- Use testing data to generate prediction tensors for both models and run paired t-test.