7
Jan
2020
/
James Le, Hybrid Data Scientist/Data Journalist

Classifying Tweets with Weights & Biases

Introduction

A common NLP task is to classify text. The most common text classification is done in sentiment analysis, where texts are classified as positive or negative. In this project, we will consider a slightly harder problem, classifying whether a tweet is about an actual disaster happening or not.

Not all tweets that contain words associated with disasters are actually about disasters. A tweet such as, "California forests on fire near San Francisco" is a tweet that should be taken into consideration, whereas "California this weekend was on fire, good times in San Francisco" can safely be ignored.

The goal of the task here is to build a classifier that separates the tweets that relate to real disasters from irrelevant tweets. The dataset that we are using consists of hand-labeled tweets that were obtained by searching Twitter for words common to disaster tweets.

Note: You can find the accompanying code in this Colab Notebook. We highly encourage you to fork it, tweak the parameters, or try the model with your own dataset!

Setup

Start out by installing the experiment tracking library and setting up your free W&B account:

The Data

The dataset is called “Disasters on Social Media”, which is gathered from Figure Eight. Contributors looked at over 10,000 tweets culled with a variety of searches like “ablaze”, “quarantine”, and “pandemonium”, then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous).

Prepare The Target

There are several possible prediction targets in this dataset. In our case, humans were asked to rate a tweet, and, they were given three options, Relevant, Not Relevant, and Can't Decide, as the text below shows:

# Remove the category "Can't Decide"
df = df[df.choose_one != "Can't Decide"]


# Keep only the 2 columns below as we only want to map text to relevance
df = df[['text','choose_one’]]

# Convert the target into binary numbers
df['relevant'] = df.choose_one.map({'Relevant':1,'Not Relevant':0})

Lemmatization

A lemma (in the field of linguistics) is the word under which the set of related words or forms appears in a dictionary. For example, "was" and "is" appear under "be," "mice" appears under "mouse," and so on. Quite often, the specific form of a word does not matter very much, so it can be a good idea to convert all your text into its lemma form.

import spacy
nlp = spacy.load('en',disable=['tagger','parser','ner’])

# Loop over the words in the 'text' column
# Save the lemma of the word in a new 'lemmas' column
df['lemmas'] = df['text'].apply(lambda row: [w.lemma_ for w in nlp(row)])

# Turn the lists in 'lemmas' back to text
df['joint_lemmas'] = df['lemmas'].apply(lambda row: ' '.join(row))

Here is the new data frame:

Word Embeddings

The order of words in a text matters. Therefore, we can expect higher performance if we do not just look at texts in aggregate but see them as a sequence.

Embeddings work like a lookup table. For each token, they store a vector. When the token is given to the embedding layer, it returns the vector for that token and passes it through the neural network. As the network trains, the embeddings get optimized as well.

Remember that neural networks work by calculating the derivative of the loss function with respect to the parameters (weights) of the model. Through backpropagation, we can also calculate the derivative of the loss function with respect to the input of the model. Thus we can optimize the embeddings to deliver ideal inputs that help our model.

Before we start with training word embeddings, we need to do some pre-processing steps. In particular, we need to assign each word token a number and create a NumPy array full of sequences.


# The Tokenizer class allows us to specify how many words to consider
from keras.preprocessing.text import Tokenizer
max_words = 7000 # We will only consider the 7K most used words in this dataset

# Create a new Tokenizer object
tokenizer = Tokenizer(num_words=max_words)
# Generate tokens by counting frequency
tokenizer.fit_on_texts(df['joint_lemmas'])
# Transform the text into tokenized sequences
sequences = tokenizer.texts_to_sequences(df['joint_lemmas'])

# Look up the mappings of words to numbers from the tokenizer word index
word_index = tokenizer.word_index


Next, we need to turn our sequences into sequences of equal length. This is not always necessary, as some model types can deal with sequences of different lengths, but it usually makes sense and is often required.


# Use Keras' pad_sequences to bring all of the sequences to the same length
from keras.preprocessing.sequence import pad_sequences

# Make all sequences 140 words long (max length of tweets)
maxlen = 140
data = pad_sequences(sequences, maxlen=maxlen)

# Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, df['relevant'],
test_size = 0.3, shuffle=True, random_state = 1024)

The Models

Below I add Weights and Biases to track my model performance:


# Initilize a new wandb run
wandb.init(entity="khanhnamle1994", project="tweet-classification")

# Default values for hyper-parameters
config = wandb.config # Config holds and saves hyperparameters and inputs
config.epochs = 10 # Number of epochs
config.batch_size = 32 # Batch size
config.embedding_dim = 70 # Dimension of the embedding layer
config.activation = 'sigmoid' # Activation function
config.optimizer = 'adam' # Optimization technique

Feedforward Neural Network

Let's train our word vectors. To use embeddings, we have to specify how large we want the word vectors to be. The 70-dimensional vector that we have chosen to use is able to capture good embeddings even for quite large vocabularies. Additionally, we also have to specify how many words we want embeddings for and how long our sequences are.

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

embedding_dim = config.embedding_dim

# Create the Model
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation=config.activation))

# Display model architecture
model.summary()

The embedding layer has 70 parameters for 70,000 words equaling 490,000 parameters in total. This might possibly lead to overfitting. The next step is to compile and train our model.

# Compile the model
model.compile(optimizer=config.optimizer,
             loss='binary_crossentropy',
             metrics=['acc’])

# Fit and train the model
history = model.fit(X_train, y_train,
                   epochs=config.epochs,
                   batch_size=config.batch_size,
                   validation_data=(X_test, y_test),
                   callbacks=[WandbCallback()])


The model achieves about 78% accuracy on the test set, but over 98% accuracy on the training set. The large number of parameters in the custom embeddings has led to overfitting.

Long Short Term Memory Network

Text is a time series. Different words follow each other and the order in which they do matters. Therefore, every neural network-based technique for time series problems can also be used for NLP. Below I used the Long Short Term Memory model, which can not only process single data points but also entire sequences of data. They were developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional Recurrent Neural Networks.

from keras.layers import LSTM

embedding_dim = config.embedding_dim

# Create another model and replace Flatten 'layer' with 'LSTM' layer
model_lstm = Sequential()
model_lstm.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model_lstm.add(LSTM(32))
model_lstm.add(Dense(1, activation=config.activation))
model_lstm.summary()
model_lstm.compile(optimizer=config.optimizer,
                 loss='binary_crossentropy',
                 metrics=['acc’])
history = model_lstm.fit(X_train, y_train,
                        epochs=config.epochs,
                        batch_size=config.batch_size,
                        validation_data=(X_test, y_test),
                        callbacks=[WandbCallback()])

The model achieves about 77% accuracy on the test set, but over 97% accuracy on the training set. Not much better than the previous model.

Bidirectional Recurrent Neural Network

Next, I used the Bidirectional Recurrent Neural Networks, which splits the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). In this generative model, the output layer can get information from past (backwards) and future (forward) states simultaneously.

from keras.layers import Bidirectional

embedding_dim = config.embedding_dim

# Create another model and wrap Bidirectional layer around LSTM layer
model_birnn = Sequential()
model_birnn.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model_birnn.add(Bidirectional(LSTM(64,return_sequences=True)))
model_birnn.add(Bidirectional(LSTM(32)))
model_birnn.add(Dense(1, activation=config.activation))
model_birnn.summary()
model_birnn.compile(optimizer=config.optimizer,
                   loss='binary_crossentropy',
                   metrics=['acc’])
history = model_birnn.fit(X_train, y_train,
                         epochs=config.epochs,
                         batch_size=config.batch_size,
                         validation_data=(X_test, y_test),
                         callbacks=[WandbCallback()])

The model achieves about 76% accuracy on the test set, but over 97% accuracy on the training set. Seems like we hits diminishing returns.

Comparison

Let’s have a comparison on the performance between these models in Weights and Biases. In the images below:

  1. The run Vanilla-Feedforward-NN is the Vanilla Feedforward Neural Network model.
  2. The run LSTM is the Long Short Term Memory Network model.
  3. The run Bidirectional-RNN is the Bidirectional Recurrent Neural Network model.

As seen above, the feedforward model has the highest accuracy on the training set, followed by the Bidirectional RNN and the LSTM.

The results on the test set shows that the Feedforward model still has the highest accuracy. The LSTM model does better than the Bidirectional model this time.

Visualized Predictions Live

Project Overview

  1. Check out the project page to see your results in the shared project.
  2. Press 'option+space' to expand the runs table, comparing all the results from everyone who has tried this script.
  3. Click on the name of a run to dive in deeper to that single run on its own run page.

Visualize Performance

Click through to a single run to see more details about that run. For example, on this run page you can see the performance metrics I logged when I ran this script.

Review Code

The overview tab picks up a link to the code. In this case, it's a link to the Google Colab. If you're running a script from a git repo, we'll pick up the SHA of the latest git commit and give you a link to that version of the code in your own GitHub repo.

Visualize System Metrics

The System tab on the runs page lets you visualize how resource efficient your model was. It lets you monitor the GPU, memory, CPU, disk, and network usage in one spot.

More About Weights & Biases

Here are some more resources that you can use to learn about W&B:

  1. Documentation - Python docs
  2. Gallery - example reports in W&B
  3. Articles - blog posts and tutorials
  4. Community - join our Slack community forum
Join our mailing list to get the latest machine learning updates.