Building a Machine Learning Model

Short Summary: An overview of techniques for building a machine learning model.

Undersample the majority class.
- Problematic since removing samples means getting rid of data.
Oversample the minority class (adding duplicate samples).
- Problematic since duplicating data makes some samples more important than others.
Augment the minority class via SMOTE or other approaches (adds small perturbations).
- Problematic since adding synthetic data can introduce bias or complexity.
Modify the loss function such that the error on minority class is penalized more.
- Weight of the class is the size of the largest class divided by the size of the class.
Take the majority class, divide into clusters, and replace the examples with the cluster centroids
Learn \(k\) separate models: take \(\frac{n}{k}\) samples of the majority class and all the samples of the minority class (an ensemble of \(k\) models)

Use Principle Component Analysis (PCA)
Use Linear Discriminant Analysis (LDA)
Use Pearson correlation to remove features with low target-feature correlation.
Use Pearson correlation to remove feature groups with high feature-feature correlation.
Use model-based feature selection:
- L1 regularization.
- Elastic net regression (L1 regularization + L2 regularization).

Detection:
- Look at loss plots.
  - Overfitting if loss is low in training, but high in testing.
  - Underfitting if loss is high in training (and likely in testing).
- Perform K-fold cross-validation or leave-one-out cross-validation.
Approaches for dealing with underfitting (high bias, low variance):
- Increase model complexity.
- Try a larger set of features.
Approaches for dealing with overfitting (high variance, low bias):
- Decrease model complexity.
- Use pruning in decision trees, random forest or gradient boosting over decision trees.
- Use L1 regularization, L2 regularization, or elastic net regression.
- Use weight decay, dropout layers, or batch normalization in neural networks.
- Get more training data.
- Try a smaller set of features as more features can lead to:
  - More dimensions and therefore, more sparsity.
  - Increased complexity and redundancy of features that are not at all related to prediction.
- Early stopping.

Offline metrics:
- In balanced datasets, can use accuracy.
- In unbalanced datasets, use precision, recall, F-score, and AUC.
- Precision@\(k\): compute precision for TOP \(k\) recommendations
Online metrics (empirical results observed in user interactions with real-time system).
- A/B testing.
- Click-Through Rate (CTR): \(\frac{\text{Clicks (Number of Users Clicking the Advertisement)}}{\text{Impressions (Number of People Who Saw the Advertisement)}}\)
Comparing two models and determining the statistical significance of the difference.
- Use testing data to generate prediction tensors for both models and run paired t-test.