19
Sep
2019
/
Lavanya Shukla, ML engineer at Weights & Biases

Part II: A Whirlwind Tour of Machine Learning Models

Part I: Best Practices for Building a Machine Learning Model

Part II: A Whirlwind Tour of Machine Learning Models

Code

In Part I, Best Practices for Building a Machine Learning Model, we talked about the part art, part science of picking the perfect machine learning model.

In Part II, we dive deeper into the different machine learning models you can train and when you should use them!

In general tree-based models perform best in Kaggle competitions. The other models make great candidates for ensembling. For computer vision challenges, CNNs outperform everything. For natural language processing, LSTMs or GRUs are your best bet!

With that said, below is a non-exhaustive laundry list of models to try, along with some context for each model.

What We'll Cover

Regression - Predicting continuous values

A. Linear Regression

B. Regression Trees

C. Deep Learning

D. K Nearest Neighbors - Distance Based

Classification - Predict a class or class probabilities

A. Logistic Regression

B. Support Vector Machines - Distance based

C. Naive Bayes - Probability based

D. K Nearest Neighbors - Distance Based

E. Classification Tree

F. Deep Learning

Clustering - Organize the data into groups to maximize similarity

A. DBSCAN

B. KMeans

Ensembling Your Models

Regression

Regression → Linear Regression → Vanilla Linear Regression

Advantages

Disadvantages

Regression → Linear Regression → Lasso, Ridge, Elastic-Net Regression

Advantages

Disadvantages

Regression → Regression Trees → Decision Tree

Advantages

Disadvantages

Regression → Regression Trees → Ensembles

Advantages

Disadvantages

Regression → Deep Learning

Advantages

Disadvantages

Regression → K Nearest Neighbors (Distance Based)

Advantages

Disadvantages

2. Classification

Classification → Logistic Regression

Advantages

Disadvantages

Classification → Support Vector Machines (Distance based)

Advantages

Disadvantages

Classification → Naive Bayes (Probability based)

Advantages

Disadvantages

Classification → K Nearest Neighbors (Distance Based)

Advantages

Disadvantages

Classification → Classification Tree → Decision Tree

Advantages

Disadvantages

Classification → Classification Tree → Ensembles

Advantages

Disadvantages


Classification → Deep Learning

Advantages

Disadvantages

3. Clustering

Clustering → DBSCAN

Advantages

Disadvantages

Clustering → KMeans

Advantages

Disadvantages

4. Misc - Models not included in this post

Ensembling Your Models

Ensembling models is a really powerful technique that helps reduce overfitting, and make more robust predictions by combining outputs from different models. It is especially an essential tool for winning Kaggle competitions.

When picking models to ensemble together, we want to pick them from different model classes to ensure they have different strengths and weaknesses and thus capture different patterns in the dataset. This greater diversity leads to lower bias. We also want to make sure their performance is comparable in order to ensure stability of predictions generated.

We can see here that the blending these models actually resulted in much lower loss than any single model was able to produce alone. Part of the reason is that while all these models are pretty good at making predictions, they get different predictions right and by combining them together, we're able to combine all their different strengths into a super strong model.

# Blend models in order to make the final predictions more robust to overfitting

def blended_predictions(X):

   return ((0.1 * ridge_model_full_data.predict(X)) + \\
           (0.2 * svr_model_full_data.predict(X)) + \\
           (0.1 * gbr_model_full_data.predict(X)) + \\
           (0.1 * xgb_model_full_data.predict(X)) + \\
           (0.1 * lgb_model_full_data.predict(X)) + \\
           (0.05 * rf_model_full_data.predict(X)) + \\
           (0.35 * stack_gen_model.predict(np.array(X))))

There are 4 types of ensembling (including blending):

Comparing Models

Weights and Biases lets you track and compare the performance of you models with one line of code.

Once you have selected the models you’d like to try, train them and simply add wandb.log({'score': cv_score}) to log your model state. Once you’re done training, you can compare your model performances in one easy dashboard!

I encourage you to fork this kernel and play with the code!


# WandB
import wandb
import tensorflow.keras
from wandb.keras import WandbCallback
from sklearn.model_selection import cross_val_score
# Import models (Step 1: add your models here)
from sklearn import svm
from sklearn.linear_model import Ridge, RidgeCV
from xgboost import XGBRegressor

# Model 1
# Initialize wandb run
# You can change your project name here. For more config options, see https://docs.wandb.com/docs/init.html
wandb.init(anonymous='allow', project="pick-a-model")

# Initialize model (Step 2: add your classifier here)
clf = svm.SVR(C= 20, epsilon= 0.008, gamma=0.0003)

# Get CV scores
cv_scores = cross_val_score(clf, X_train, train_labels, cv=5)

# Log scores
for cv_score in cv_scores:
   wandb.log({'score': cv_score})

# Model 2
# Initialize wandb run
# You can change your project name here. For more config options, see https://docs.wandb.com/docs/init.html
wandb.init(anonymous='allow', project="pick-a-model")

# Initialize model (Step 2: add your classifier here)
clf = XGBRegressor(learning_rate=0.01,
                      n_estimators=6000,
                      max_depth=4,
                      min_child_weight=0,
                      gamma=0.6,
                      subsample=0.7,
                      colsample_bytree=0.7,
                      objective='reg:linear',
                      nthread=-1,
                      scale_pos_weight=1,
                      seed=27,
                      reg_alpha=0.00006,
                      random_state=42)

# Get CV scores
cv_scores = cross_val_score(clf, X_train, train_labels, cv=5)

# Log scores
for cv_score in cv_scores:
   wandb.log({'score': cv_score})

# Model 3
# Initialize wandb run
# You can change your project name here. For more config options, see https://docs.wandb.com/docs/init.html
wandb.init(anonymous='allow', project="pick-a-model")

# Initialize model (Step 2: add your classifier here)
ridge_alphas = [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5, 10, 15, 18, 20, 30, 50, 75, 100]
clf = Ridge(alphas=ridge_alphas)

# Get CV scores
cv_scores = cross_val_score(clf, X_train, train_labels, cv=5)

# Log scores
for cv_score in cv_scores:
   wandb.log({'score': cv_score})

That’s it now you have all the tools you need to pick the right models for your problem!

Model selection and can be very complicated, but I hope this guide sheds some light and gives you a good framework for picking models.

Weights & Biases

We're building lightweight, flexible experiment tracking tools for deep learning. Add a couple of lines to your python script, and we'll keep track of your hyperparameters and output metrics, making it easy to compare runs and see the whole history of your progress. Think of us like GitHub for deep learning.

Partner Program

We are building our library of deep learning articles, and we're delighted to feature the work of community members. Contact Carey to learn about opportunities to share your research and insights.

Try our free tools for experiment tracking →