Today you are a machine learning engineer, a member of the Birdwatch at Twitter.
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, here a tweet contains hate speech if it has a racist or sexist sentiment associated with it. In other words, we need classify racist or sexist tweets from other tweets.
A labelled dataset of 31,962 tweets (late 2017 to early 2018) is provided in the form of a compressed csv file with each line storing a tweet id, its label, and the tweet. Label '1' denotes the tweet is racist/sexist while label '0' denotes the tweet is not racist/sexist.
We will first approach the problem in a traditional way: clean the raw text using simple regex (regular expression), extract features, build naive Bayes models to classify tweets; then we build a deep learning model and explain our deep learning model with LIME.
By the end of this lesson, you will be able to:
Start with dependencies.
Most modules are pre-installed in Colab, however, we need to update gensim
to its recent version and install lime
.
!pip install -U -q gensim==4.2.0 lime
|████████████████████████████████| 24.1 MB 45.2 MB/s |████████████████████████████████| 275 kB 62.7 MB/s Building wheel for lime (setup.py) ... done
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
pandas.read_csv
to load the tweets in tweets.csv.gz
and save the pd.DataFrame
into raw
. Make sure the path points to where the data is located in your Google Drive. raw = pd.read_csv("/content/drive/My Drive/Colab Notebooks/dat/nlp/dat/tweets.csv.gz")
print(raw.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 31962 entries, 0 to 31961 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 31962 non-null int64 1 label 31962 non-null int64 2 tweet 31962 non-null object dtypes: int64(2), object(1) memory usage: 749.2+ KB None
label
and tweet
columns. Hint: one option is to use sample()
followed by groupby
. # Colab includes an extension that renders pandas dataframes into interactive displays that can be filtered, sorted, and explored dynamically.
from google.colab import data_table
data_table.enable_dataframe_formatter()
# YOUR CODE HERE
raw[['label','tweet']].groupby('label').sample(5)
label | tweet | |
---|---|---|
27691 | 0 | so cool! i love the new #youtube 4 #gamers and... |
25568 | 0 | we open mid-july but are taking bookings now! ... |
1050 | 0 | @user we can do no great things, only small t... |
19517 | 0 | @user 80-yr-old hindu man #gokaldas beaten up... |
14280 | 0 | happy 6th bihday junior ððâ¤â¤ð i h... |
19161 | 1 | @user @user @user @user @user why would i ta... |
20915 | 1 | #newyear 'wish list' of cretin #carlpaladino,... |
23279 | 1 | this is sooooo or may be just funny |
11308 | 1 | you might be a libtard if... #libtard #sjw #l... |
4708 | 1 | #lgbti #poc need to speak up against the that... |
len_tweet
in raw
. # YOUR CODE HERE
raw['len_tweet'] = raw['tweet'].apply(lambda x:len(x))
raw.head()
id | label | tweet | len_tweet | |
---|---|---|---|---|
0 | 1 | 0 | @user when a father is dysfunctional and is s... | 102 |
1 | 2 | 0 | @user @user thanks for #lyft credit i can't us... | 122 |
2 | 3 | 0 | bihday your majesty | 21 |
3 | 4 | 0 | #model i love u take with u all the time in ... | 86 |
4 | 5 | 0 | factsguide: society now #motivation | 39 |
len_tweet
for each label? Hint: use groupby
and describe
. pd.set_option("display.precision", 1)
raw.groupby('label')['len_tweet'].describe()
# YOUR CODE HERE
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
label | ||||||||
0 | 29720.0 | 84.3 | 29.6 | 11.0 | 62.0 | 88.0 | 107.0 | 274.0 |
1 | 2242.0 | 90.2 | 27.4 | 12.0 | 69.0 | 96.0 | 111.0 | 152.0 |
Note it is again an imbalanced dataset: the ratio of non hate speech to hate speech is roughly 13:1.
Clean the tweets.
We use re
to perform basic text manipulations. Specically, remove anonymized user handle, remove numbers and special characters except hashtags.
import re
raw.sample(5, random_state = 203)
id | label | tweet | len_tweet | |
---|---|---|---|---|
790 | 791 | 1 | @user and you keep telling that only aryans ar... | 109 |
21928 | 21929 | 0 | @user what makes you ? | 25 |
25642 | 25643 | 0 | â #nzd/usd extends rbnz-led rally, hits fre... | 101 |
20436 | 20437 | 0 | i'm on a mission to ride all of the animals! ... | 91 |
22552 | 22553 | 0 | the color of a human skin matters a lot to the... | 88 |
remove user handles from the text in tweet
, or anything directly following the symbols @
, and save the resulting tweets to a new column tidy_tweet
in raw
.
Hint: you can use re.sub
on individual text and apply
a simple lambda function for the series raw['tweet']
.
raw['tidy_tweet'] = raw['tweet'].apply(lambda tweet: re.sub(r'(@[\w]+)', '', tweet)) # YOUR CODE HERE
raw.sample(5, random_state=203)
id | label | tweet | len_tweet | tidy_tweet | |
---|---|---|---|---|---|
790 | 791 | 1 | @user and you keep telling that only aryans ar... | 109 | and you keep telling that only aryans are all... |
21928 | 21929 | 0 | @user what makes you ? | 25 | what makes you ? |
25642 | 25643 | 0 | â #nzd/usd extends rbnz-led rally, hits fre... | 101 | â #nzd/usd extends rbnz-led rally, hits fre... |
20436 | 20437 | 0 | i'm on a mission to ride all of the animals! ... | 91 | i'm on a mission to ride all of the animals! ... |
22552 | 22553 | 0 | the color of a human skin matters a lot to the... | 88 | the color of a human skin matters a lot to the... |
remove non-alphabetic characters yet keep symbols #
from tidy_tweet
and save the result in tidy_tweet
. In other words, keep all 26 letters and #
.
Note: in some applications, punctuations, emojis, or whether the word is in all caps can be of use. You shall decide whether to extract such features for the application and perform error analysis to gain insight.
raw['tidy_tweet'] = raw['tidy_tweet'].apply(lambda tweet: re.sub('[^A-Za-z0-9#\s ]+', '', tweet))# YOUR CODE HERE
raw.sample(5, random_state=203)
id | label | tweet | len_tweet | tidy_tweet | |
---|---|---|---|---|---|
790 | 791 | 1 | @user and you keep telling that only aryans ar... | 109 | and you keep telling that only aryans are all... |
21928 | 21929 | 0 | @user what makes you ? | 25 | what makes you |
25642 | 25643 | 0 | â #nzd/usd extends rbnz-led rally, hits fre... | 101 | #nzdusd extends rbnzled rally hits fresh 1ye... |
20436 | 20437 | 0 | i'm on a mission to ride all of the animals! ... | 91 | im on a mission to ride all of the animals #... |
22552 | 22553 | 0 | the color of a human skin matters a lot to the... | 88 | the color of a human skin matters a lot to the... |
remove words that is shorter than 4 characters from the processed tweets.
For example,
i m on a mission to ride all of the animals #teamchanlv #vegas #lasvegas #funtimes
will be reduced to
mission ride animals #teamchanlv #vegas #lasvegas #funtimes
raw['tidy_tweet'] = raw['tidy_tweet'].apply(lambda tweet:re.sub(r'\b\w{1,3}\b', '', tweet)) # YOUR CODE HERE
raw.sample(5, random_state=203)
id | label | tweet | len_tweet | tidy_tweet | |
---|---|---|---|---|---|
790 | 791 | 1 | @user and you keep telling that only aryans ar... | 109 | keep telling that only aryans allowed rap... |
21928 | 21929 | 0 | @user what makes you ? | 25 | what makes |
25642 | 25643 | 0 | â #nzd/usd extends rbnz-led rally, hits fre... | 101 | #nzdusd extends rbnzled rally hits fresh 1ye... |
20436 | 20437 | 0 | i'm on a mission to ride all of the animals! ... | 91 | mission ride animals #teamchanlv #veg... |
22552 | 22553 | 0 | the color of a human skin matters a lot to the... | 88 | color human skin matters system when c... |
Remove stop words and text normalization.
We will use stopwords
collection and SnowballStemmer
in nltk
for this task. Before doing so, we need to tokenize the tweets. Tokens are individual terms or words, and tokenization is simply to split a string of text into tokens. You can use str.split()
on individual text and apply
a simple lambda function for the series raw['tidy_tweet']
and save the result into tokenized_tweet
.
Check out some methods for the built-in type str
here.
tokenized_tweet = raw['tidy_tweet'].apply(lambda t: t.split()) # YOUR CODE HERE
tokenized_tweet.head()
0 [when, father, dysfunctional, selfish, drags, ... 1 [thanks, #lyft, credit, cant, cause, they, don... 2 [bihday, your, majesty] 3 [#model, love, take, with, time] 4 [factsguide, society, #motivation] Name: tidy_tweet, dtype: object
Extract stop words and remove them from the tokens.
Note: depending on the task / industry, it is highly recommended that one curate custom stop words.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
tokenized_tweet = tokenized_tweet.apply(lambda x: [token for token in x if token not in stop_words]) # YOUR CODE HERE
assert any(word in tokenized_tweet for word in stop_words) == False
SnowballStemmer
, set the language
to be "english"; see how to. from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')# YOUR CODE HERE
tokenized_tweet = tokenized_tweet.apply(lambda x: ' '.join(x))
tokenized_tweet.head()
0 father dysfunctional selfish drags kids dysfun... 1 thanks #lyft credit cant cause dont offer whee... 2 bihday majesty 3 #model love take time 4 factsguide society #motivation Name: tidy_tweet, dtype: object
tokenized_tweet
back together and save them in raw['tidy_tweet']
. Use str.join()
and apply
.raw['tidy_tweet'] = tokenized_tweet.apply(lambda t:t.join(t))# YOUR CODE HERE
raw.sample(1, random_state=203)
id | label | tweet | len_tweet | tidy_tweet | |
---|---|---|---|---|---|
790 | 791 | 1 | @user and you keep telling that only aryans ar... | 109 | kkeep telling aryans allowed rape women youre ... |
In this task, we want to gain a general idea of what the common words were and how hashtags were used in tweets. We will create wordclouds and extract the top hashtags used in each label.
Before doing so, out of caution of possible data leakage, split the raw['tidy_tweet']
into training and test datasets in a stratified fashion, set the test size at .25 and random state as 42.
Save the results into X_train
, X_test
, y_train
, and y_test
.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
raw['tidy_tweet'], raw.label,
test_size=0.25, random_state=42, stratify=raw.label)
assert X_train.shape == y_train.shape == (23971, )
assert X_test.shape == y_test.shape == (7991,)
A word cloud is a cluster of words depicted in different sizes. The bigger the word appears, the more often it appears in the given text. It can offer an easy visual presentation to reveal the theme of a topic.
Function plot_wordcloud
is provided to plot 50 most frequent words from the given text in the shape of twitter's logo. You may need to replace the image path accordingly.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
def plot_wordcloud(text:str) -> None:
'''
Plot a wordcloud of top 50 words from the input text
masked by twitter logo
'''
mask = np.array(Image.open('/content/drive/My Drive/Colab Notebooks/dat/nlp/img/twitter-mask.png')) # REPLACE w/ YOUR FILE PATH
wordcloud = WordCloud(
background_color='white',
random_state=42,
max_words=50,
max_font_size=80,
mask = mask).generate(text)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Visualize the wordcloud.
That the function expects one long string. Stitch all tidy tweets from training set and save the single string to all_words
, then visualize the wordcloud for all the words.
all_words = X_train.str.cat(sep = ' ')# YOUR CODE HERE
plot_wordcloud(all_words)
Visualize the wordcloud just for the text from the tweets identified as hate speech.
Similarly, you need to stitch all the tidy tweets in training set that were identified as hate speech. Save the long string to negative_words
.
negative_words = ''.join(X_train[y_train==1]) # YOUR CODE HERE
plot_wordcloud(negative_words)
Hashtag is an feature for tweets and we would like to inspect if hashtags provide information for our classification task.
Function hashtag_extract
is provided to extract hashtags from an iterable (list or series) and return the hashtags in a list.
def hashtag_extract(x) -> list:
"""
extract hastags from an iterable (list or series) and
return the hashtags in a list.
"""
hashtags = []
# Loop over the words in the tweet
for i in x:
ht = re.findall(r"#(\w+)", i)
hashtags.append(ht)
return hashtags
HT_regular = hashtag_extract(X_train[y_train == 0]) # YOUR CODE HERE
len(HT_regular)
22290
assert type(HT_regular) == list
assert type(HT_regular[0]) == list # nested list
HT_negative = hashtag_extract(X_train[y_train == 1])# YOUR CODE HERE
HT_regular
and HT_negative
are nested lists, use the following trick to un-nest both listsHT_regular = sum(HT_regular,[])
HT_negative = sum(HT_negative,[])
HT_negative[0]
'democracy'
assert type(HT_regular) == type(HT_negative) == list
assert type(HT_regular[0]) == type(HT_negative[0]) == str
top_hashtags
that takes a list of hashtags and return the top n
hashtag keyword and its frequency. from typing import List, Tuple
from collections import Counter
def top_hashtags(hashtags:List[str], n=10) -> List[Tuple[str, int]]:
''' Function to return the top n hashtags '''
# YOUR CODE HERE
c = Counter(hashtags)
sorted_c = sorted(c.items(),key=lambda kv:kv[1], reverse = True)
return sorted_c[:n]
# YOUR CODE HERE
top_hashtags(HT_regular)
[('love', 67022), ('smile', 23494), ('healthy', 19185), ('life', 17773), ('cute', 16660), ('summer', 16549), ('blog', 16403), ('gold', 14687), ('thankful', 14423), ('positive', 14401)]
# YOUR CODE HERE
top_hashtags(HT_negative)
[('trump', 5731), ('allahsoil', 4834), ('retweet', 2536), ('liberal', 2515), ('libtard', 2400), ('politics', 2025), ('black', 1809), ('brexit', 1537), ('hate', 1390), ('tampa', 1376)]
Discuss: are these hashtags making sense? should we include them as features or should we strip the # before tokenizing (that is, treat "#love" the same as "love")? why and why not?
YOUR ANSWER HERE
Note that almost all the machine learning related Python modules expects numerical presentation of data; thus we need to transform our text first.We will experiment bag of words, tf-idf, and word2vec.
Convert the collection of text documents to a matrix of token counts.
Check the official documentation.
Create an instance of CountVectorizer
named bow_vectorizer
, set max_features
to be MAX_FEATURES
.
Learn the vocabulary dictionary and return document-term matrix and save it to bow_train
. Use .fit_transform
.
from sklearn.feature_extraction.text import CountVectorizer
MAX_FEATURES = 1000
bow_vectorizer = CountVectorizer(max_features = MAX_FEATURES) # YOUR CODE HERE
bow_train = bow_vectorizer.fit_transform(X_train) # YOUR CODE HERE
assert bow_train.shape == (X_train.shape[0], MAX_FEATURES)
bow_train
. Hint: .toarray()
.# YOUR CODE HERE
bow_train[0:3].toarray()
array([[0, 0, 0, ..., 0, 0, 0], [9, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])
from scipy.sparse.csr import csr_matrix
assert type(bow_train) == csr_matrix
Similarly, convert the collection of text documents to a matrix of TF-IDF features.
Create an instance of TfidfVectorizer
named tfidf_vectorizer
, set max_features
to be MAX_FEATURES
.
Learn the vocabulary and idf, return document-term matrix and save it to tfidf_train
.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=MAX_FEATURES, stop_words='english') # YOUR CODE HERE
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
assert type(tfidf_train) == csr_matrix
assert tfidf_train.shape == bow_train.shape == (X_train.shape[0], MAX_FEATURES)
Extract word embedding using Word2Vec. We will use gensim
for this task.
The Word2Vec model takes either a list of lists of tokens or an iterable that streams the sentences directly from disk/network. Here, we tokenize the tidy tweets in X_train
and save the list (pd.series
) of lists of tokens to tokenized_tweet
.
tokenized_tweet = pd.Series(X_train)# YOUR CODE HERE
assert tokenized_tweet.shape == X_train.shape
tokenized_tweet.head()
1036 llike spread peanut butter white bread #little... 2380 wwatching made america simpson 30for30 interes... 31605 ffrancis underwood seen leaving marseille #noj... 23437 ## #enjoy #music #today #free #apps #free #mus... 2669 ##juicing experience #notsobad #healthyliving ... Name: tidy_tweet, dtype: object
Import Word2Vec
from gensim.models
; see doc.
Create a skip-gram Word2Vec
instance named w2v
that learns on the tokenized_tweet
, with vector_size
set at MAX_FEATURES
, and other parameters are provided.
from gensim.models import Word2Vec
w2v = Word2Vec(
sentences=tokenized_tweet,
vector_size= MAX_FEATURES,# YOUR CODE HERE
window=5, min_count=2, sg = 1,
hs = 0, negative = 10, workers= 2,
seed = 34)
%%time
# YOUR CODE HERE
w2v.train(tokenized_tweet,total_examples=1, epochs=20)
most_similar
. Hint: print the type of w2v
and w2v.wv
.# YOUR CODE HERE
print(type(w2v))
print(type(w2v.wv))
w2v.wv.most_similar(positive=['love'])
Discuss: how does it calculate the similarities?
It uses Cosine similarity. Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.
Discuss: do you think Word2Vec is supervised or unsupervised?
Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered.
Engineer features.
For each tweet, we calculate the average of embeddings (function word_vector
) and then apply it to every tidy tweet in X_train
(use function tokens_to_array
).
Both functions are provided, inspect the code and save the features in w2v_train
.
from gensim.models.keyedvectors import KeyedVectors
def word_vector(tokens:list, size:int, keyed_vec:KeyedVectors= w2v.wv):
vec = np.zeros(size).reshape((1, size))
count = 0
for word in tokens:
try:
vec += keyed_vec[word].reshape((1, size))
count += 1
except KeyError:
# handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
def tokens_to_array(tokens:list, size:int, keyed_vec:KeyedVectors= w2v.wv):
array = np.zeros((len(tokens), size))
for i in range(len(tokens)):
array[i,:] = word_vector(tokens.iloc[i], size, keyed_vec=keyed_vec)
return array
w2v_train = tokens_to_array(X_train, size=MAX_FEATURES)# YOUR CODE HERE
assert w2v_train.shape == (X_train.shape[0], MAX_FEATURES)
X_test
using the bag of words approach; use bow_vectorizer
X_test
using the tf-idf approach; use tfidf_vectorizer
X_test
using Word2Vec embeddings; you need to first tokenized the tidy tweets in X_test
, then convert the tokens to array of shape (X_test.shape[0], MAX_FEATURES)
.bow_test = bow_vectorizer.transform(X_test) # YOUR CODE HERE
tfidf_test = tfidf_vectorizer.transform(X_test) # YOUR CODE HERE
tokenized_tweet_test = X_test.str.split() # YOUR CODE HERE
w2v_test = tokens_to_array(X_test, size=MAX_FEATURES) # YOUR CODE HERE
assert bow_test.shape == tfidf_test.shape == w2v_test.shape == (X_test.shape[0], MAX_FEATURES)
In this task, you will build naive Bayes, another ref, classifiers to identify the hate speech tweets using different sets of features from the last task, and evaluate their performances.
In the era of deep learning, naive Bayes are useful due to its simplicity and reasonable performance, especially if there is not much training data available. A common interview question is "Why is naive Bayes naive?".
We will use multi-variate Bernoulli naive Bayes BernoulliNB
; try other flavors of naive Bayes if time permits. Code is pretty straightforward.
BernoulliNB
for modeling and classification_report
for reporting performance. from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
Create an instance of BernoulliNB
named BNBmodel
.
We can use it for all three feature sets.
# YOUR CODE HERE
BNBmodel = BernoulliNB()
# YOUR CODE HERE (train the model)
BNBmodel.fit(bow_train, y_train)
# YOUR CODE HERE (report)
print(classification_report(BNBmodel.predict(bow_test), y_test))
Similarly, train the model using tf-idf features and print the performance report.
Is the performance expected? Why or why not?
# YOUR CODE HERE
BNBmodel2 = BernoulliNB()
BNBmodel2.fit(tfidf_train, y_train)
print(classification_report(BNBmodel2.predict(tfidf_test), y_test))
# YOUR CODE HERE
BNBmodel3 = BernoulliNB()
BNBmodel3.fit(w2v_train, y_train)
print(classification_report(BNBmodel3.predict(w2v_test), y_test))
Discuss the differences in performance using tf-idf vs skim-gram embedding.
YOUR ANSWER HERE
Examine a few tweets where the model(s) failed. What other features would you include in the next iteration?
YOUR ANSWER HERE
In this task, you will build a bidirectional LSTM (BiLSTM) model to detect tweets identified as hate speech, and visualize the embedding layer using Tensorboard projector.
Why BiLSTM? LSTM, at its core, preserves information from inputs that has already passed through it using the hidden state. Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past. BiLSTMs run inputs in both ways, one from past to future and one from future to past and show very good results as they can understand context better ref.
Tokenizing and padding.
As LSTM expects every sentence to be of the same length, in addition to Tokenizer
with a given number of vocabulary VOCAB_SIZE
, we need to pad shorter tweets with 0s until the length is MAX_LEN
and truncate longer tweets to be exact MAX_LEN
long.
Function tokenize_pad_sequences
is provided except that you need to supply correct num_words
and filters
; do NOT filter #
.
We feed the processed tidy_tweet
to tokenize_pad_sequences
, but one can perform the preprocessing steps in Tokenizer
and apply it directly on the raw tweets.
VOCAB_SIZE = 25000
MAX_LEN = 50
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
def tokenize_pad_sequences(text):
'''
tokenize the input text into sequences of integers and then
pad each sequence to the same length
'''
# Text tokenization
tokenizer = Tokenizer(
num_words=VOCAB_SIZE,# YOUR CODE HERE
filters='[^a-zA-Z#\s]',# YOUR CODE HERE
lower=True, split=' ', oov_token='oov')
tokenizer.fit_on_texts(text)
# Transforms text to a sequence of integers
X = tokenizer.texts_to_sequences(text)
# Pad sequences to the same length
X = pad_sequences(X, padding='post', maxlen=MAX_LEN)
return X, tokenizer
print('Before Tokenization & Padding \n', raw['tidy_tweet'][0])
X, tokenizer = tokenize_pad_sequences(raw['tidy_tweet'])
print('After Tokenization & Padding \n', X[0])
y = raw['label'].values
X
into training and testing datasets, save 25% for testing. Then split training dataset into training and validation datasets, with 20% for validation. Set both random_state
to be 42. Both splits shall be done by stratification.X_train, X_test, y_train, y_test = train_test_split(
# YOUR CODE HERE
X, raw.label,
test_size=0.25, random_state=42, stratify=raw.label
)
X_train, X_val, y_train, y_val = train_test_split(
# YOUR CODE HERE
X_train, y_train,
test_size=0.20, random_state=42, stratify=y_train
)
print('Train Set ->', X_train.shape, y_train.shape)
print('Validation Set ->', X_val.shape, y_val.shape)
print('Test Set ->', X_test.shape, y_test.shape)
Now build a sequential model:
return_sequences=True
in LSTMfrom keras.models import Sequential
# YOUR CODE HERE (layer imports)
from tensorflow.keras import layers
EMBEDDING_DIM = 16
model = Sequential([
# YOUR CODE HERE
tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LEN), # embedding layer
tf.keras.layers.Bidirectional(layers.LSTM(32, return_sequences=True)), #bLSTM layer
tf.keras.layers.GlobalAveragePooling1D(), # pooling layer
tf.keras.layers.Dropout(0.20), # dropout layer
tf.keras.layers.Dense(32, activation="relu"), # ReLu layer
tf.keras.layers.Dense(1, activation="sigmoid") # classification layer
])
model.summary()
Compile the model.
Fill in a proper loss function and use adam as the optimizer. For metrics, include precision and recall in the metrics, in addition to accuracy.
from keras.metrics import Precision, Recall
model.compile(
loss= "binary_crossentropy", # YOUR CODE HERE
optimizer='adam',
metrics= ["accuracy", # YOUR CODE HERE
"precision",
"recall"
]
)
EPOCHS=10
BATCH_SIZE = 32
history = model.fit(X_train, y_train,
validation_data= [x_val, y_val]# YOUR CODE HERE
batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=2)
Function plot_graphs
is provided below to visualize how the performance of model progresses as a function of epoch.
Visualize accuracy and loss.
def plot_graphs(history, metric):
fig, ax = plt.subplots()
plt.plot(history.history[metric])
plt.plot(history.history['val_'+metric], '')
ax.set_xticks(range(EPOCHS))
plt.xlabel("Epochs")
plt.ylabel(metric)
plt.legend([metric, 'val_'+metric])
# YOUR CODE HERE
plot_graphs(history,"accuracy")
# YOUR CODE HERE
plot_graphs(history,"loss")
The model starts to overfit in a couple of epochs. Consider using early stopping to stop training when a monitored metric has stopped improving.
What can we do to tame overfitting?
using cross-validation, training with more data, or stopping earlier (as suggested).
# YOUR CODE HERE
model_pred = pd.cut(pd.DataFrame(model.predict(X_test))[0], bins=[0,.5,1], labels=[0,1])
print(classification_report(y_test, model_pred))
Discuss: how does the BiLSTM model improve the classification over naive Bayes?
YOUR ANSWER HERE
# # NB using tf-idf
# precision recall f1-score support
# 0 0.96 0.97 0.97 7430
# 1 0.55 0.48 0.51 561
# accuracy 0.94 7991
# macro avg 0.75 0.72 0.74 7991
# weighted avg 0.93 0.94 0.93 7991
# # NB using word2vec
# precision recall f1-score support
# 0 0.98 0.85 0.91 7430
# 1 0.29 0.82 0.43 561
# accuracy 0.85 7991
# macro avg 0.64 0.83 0.67 7991
# weighted avg 0.94 0.85 0.88 7991
Visualize embedding using Embedding Projector in Tensorboard. The setup for Tensorboard can be tricky, most code is provided.
TensorBoard reads tensors and metadata from the logs of your tensorflow projects. The path to the log directory is specified with log_dir below.
In order to load the data into Tensorboard, we need to save a training checkpoint to that directory, along with metadata that allows for visualization of a specific layer of interest in the model.
Load the TensorBoard notebook extension and import projector
from tensorboard.plugins
.
%load_ext tensorboard
from tensorboard.plugins import projector
rm -rf /logs/
import os
log_dir='/logs/tweets-example/'
if not os.path.exists(log_dir):
os.makedirs(log_dir)
VOCAB_SIZE
most frequent words in the vocabulary as metadata.tsv
.with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
i = 0
for label in tokenizer.word_index.keys():
if label == 'oov':
continue # skip oov
f.write("{}\n".format(label))
if i > VOCAB_SIZE:
break
i += 1
weights = tf.Variable(model.layers[0].get_weights()[0][1:]) # `embeddings` has a shape of (num_vocab, embedding_dim)
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
/.ATTRIBUTES/VARIABLE_VALUE
.embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(log_dir, config)
ls /logs/tweets-example/
Now run Tensorboard against on log data we just saved.
You may need to run this cell twice to see the projector correctly. Use Chrome for least friction.
%tensorboard --logdir /logs/tweets-example/
The TensorBoard Projector can be a great tool for interpreting and visualzing embedding. The dashboard allows users to search for specific terms, and highlights words that are adjacent to each other in the embedding (low-dimensional) space. Try a few word in the Search box and see if the embeddings make sense.
Lastly let's try to understnad predictions by BiLSTM using a model agnostic approach -- Local interpretable model-agnostic explanations (LIME)
from lime.lime_text import LimeTextExplainer
LimeTextExplainer
, call it explanier
. explainer = LimeTextExplainer(class_names=['no', 'yes'], random_state=2)
explain_instance
expects the classifier_fn
to be a function, we provide the function predict_proba
as below. def predict_proba(arr):
processed = tokenizer.texts_to_sequences(arr)
processed = pad_sequences(processed, padding='post', maxlen=MAX_LEN)
pred = model.predict(processed)
r = []
for i in pred:
temp = i[0]
r.append(np.array([1-temp,temp]))
return np.array(r)
Read about explain_instance
.
Create an instance named exp
to explain the 16399th tidy tweet from the original dataset, i.e., raw.tidy_tweet.iloc[16399]
.
idx = 16399
exp = explainer.explain_instance(
# YOUR CODE HERE
raw.tidy_tweet.iloc[idx],
num_features=6)
exp.show_in_notebook(text=raw.tidy_tweet.iloc[idx])
# YOUR CODE HERE
Jolt down your observations in explaining the model.
YOUR ANSWER HERE
gensim
in buiding an embedding layers in Tensorflow, here's how-to.Answers to additional questions
How does the Naive Bayes Classifier work? What is Posterior Probability?
The NBC uses Bayes Theorem to predict the probability that the data point belongs to each class. Bayes theorem predicts based on what has already occured, or the posterior probability. What is the difference between stemming and lemmatization in NLP?
Stemming simply removes characters from the ends of words, which can lead to incorrect meanings (e.g., univers for both universal and universe). Lemmatization converts words to a base format while considering the context, which allows the meaning of the words to be kept with the base form. It is interesting to note that in cases where a the same word is used in different context, and therefore has a different meaning, the same word can have differen lemmas. What is Word2Vec and how does it work?
Word2Vec is a predictive deep learning based model which attempts to calculate and extract the relationships between words based on their occurances within a body of text. It works by converting the words into vector representations. Mathematical functions, such as cosine similarity, can be applied to the word vector representations to extract their similarity. When to use GRU over LSTM?
GRU stands for gated recurrent units and LSTM stands for long short-term memory. These are both types of recurrent neural networks. GRUs should be used when you have limited computing resources or need to get your answer faster. LSTMs are more accurate, especially on larger/longer datasets. The main differences are that the GRU does not have an output gate and the input and forget gates are combined into a single update gate.