Detect hate speech

Today you are a machine learning engineer, a member of the Birdwatch at Twitter.

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, here a tweet contains hate speech if it has a racist or sexist sentiment associated with it. In other words, we need classify racist or sexist tweets from other tweets.

A labelled dataset of 31,962 tweets (late 2017 to early 2018) is provided in the form of a compressed csv file with each line storing a tweet id, its label, and the tweet. Label '1' denotes the tweet is racist/sexist while label '0' denotes the tweet is not racist/sexist.

We will first approach the problem in a traditional way: clean the raw text using simple regex (regular expression), extract features, build naive Bayes models to classify tweets; then we build a deep learning model and explain our deep learning model with LIME.

Learning Objectives

By the end of this lesson, you will be able to:

Task I: Data Preprocessing

  1. Start with dependencies.

    Most modules are pre-installed in Colab, however, we need to update gensim to its recent version and install lime.

  1. Connect colab to your local Google Drive.
  1. Use pandas.read_csv to load the tweets in tweets.csv.gz and save the pd.DataFrame into raw. Make sure the path points to where the data is located in your Google Drive.
  1. Sample 5 random tweets from the dataset for each label and display label and tweet columns. Hint: one option is to use sample() followed by groupby.
  1. The tweets are in English and all words should be already in lowercase. Now calculate the number of characters in each tweet and assign the values to a new column len_tweet in raw.
  1. What are the summary statistics of len_tweet for each label? Hint: use groupby and describe.

Note it is again an imbalanced dataset: the ratio of non hate speech to hate speech is roughly 13:1.

  1. Clean the tweets.

    We use re to perform basic text manipulations. Specically, remove anonymized user handle, remove numbers and special characters except hashtags.

  1. remove user handles from the text in tweet, or anything directly following the symbols @, and save the resulting tweets to a new column tidy_tweet in raw.

    Hint: you can use re.sub on individual text and apply a simple lambda function for the series raw['tweet'].

  1. remove non-alphabetic characters yet keep symbols # from tidy_tweet and save the result in tidy_tweet. In other words, keep all 26 letters and #.

    Note: in some applications, punctuations, emojis, or whether the word is in all caps can be of use. You shall decide whether to extract such features for the application and perform error analysis to gain insight.

  1. remove words that is shorter than 4 characters from the processed tweets.

    For example,

    i m on a mission to ride all of the animals #teamchanlv #vegas #lasvegas #funtimes

    will be reduced to

    mission ride animals #teamchanlv #vegas #lasvegas #funtimes

  1. Remove stop words and text normalization.

    We will use stopwords collection and SnowballStemmer in nltk for this task. Before doing so, we need to tokenize the tweets. Tokens are individual terms or words, and tokenization is simply to split a string of text into tokens. You can use str.split() on individual text and apply a simple lambda function for the series raw['tidy_tweet'] and save the result into tokenized_tweet.

    Check out some methods for the built-in type str here.

  1. Extract stop words and remove them from the tokens.

    Note: depending on the task / industry, it is highly recommended that one curate custom stop words.

  1. Create a new instance of a language specific SnowballStemmer, set the language to be "english"; see how to.
  1. Lastly, let's stitch these tokens in tokenized_tweet back together and save them in raw['tidy_tweet']. Use str.join() and apply.

Task 2. Wordcloud and Hashtag

In this task, we want to gain a general idea of what the common words were and how hashtags were used in tweets. We will create wordclouds and extract the top hashtags used in each label.

  1. Before doing so, out of caution of possible data leakage, split the raw['tidy_tweet'] into training and test datasets in a stratified fashion, set the test size at .25 and random state as 42.

    Save the results into X_train, X_test, y_train, and y_test.

  1. A word cloud is a cluster of words depicted in different sizes. The bigger the word appears, the more often it appears in the given text. It can offer an easy visual presentation to reveal the theme of a topic.

    Function plot_wordcloud is provided to plot 50 most frequent words from the given text in the shape of twitter's logo. You may need to replace the image path accordingly.

  1. Visualize the wordcloud.

    That the function expects one long string. Stitch all tidy tweets from training set and save the single string to all_words, then visualize the wordcloud for all the words.

  1. Visualize the wordcloud just for the text from the tweets identified as hate speech.

    Similarly, you need to stitch all the tidy tweets in training set that were identified as hate speech. Save the long string to negative_words.

  1. Hashtag is an feature for tweets and we would like to inspect if hashtags provide information for our classification task.

    Function hashtag_extract is provided to extract hashtags from an iterable (list or series) and return the hashtags in a list.

  1. Extract hashtags from non hate speech tweets
  1. Now extract hashtags from hate speech tweets
  1. Both HT_regular and HT_negative are nested lists, use the following trick to un-nest both lists
  1. Complete the function top_hashtags that takes a list of hashtags and return the top n hashtag keyword and its frequency.
  1. Apply the function to the hashtag lists from the non-hate speech tweets and the hate speech tweets.
  1. Discuss: are these hashtags making sense? should we include them as features or should we strip the # before tokenizing (that is, treat "#love" the same as "love")? why and why not?

       YOUR ANSWER HERE

Task 3. Features

Note that almost all the machine learning related Python modules expects numerical presentation of data; thus we need to transform our text first.We will experiment bag of words, tf-idf, and word2vec.

  1. Convert the collection of text documents to a matrix of token counts.

    Check the official documentation.

    Create an instance of CountVectorizer named bow_vectorizer, set max_features to be MAX_FEATURES. Learn the vocabulary dictionary and return document-term matrix and save it to bow_train. Use .fit_transform.

  1. Print the first three rows from bow_train. Hint: .toarray().
  1. Similarly, convert the collection of text documents to a matrix of TF-IDF features.

    Create an instance of TfidfVectorizer named tfidf_vectorizer, set max_features to be MAX_FEATURES.

    Learn the vocabulary and idf, return document-term matrix and save it to tfidf_train.

  1. Extract word embedding using Word2Vec. We will use gensim for this task.

    The Word2Vec model takes either a list of lists of tokens or an iterable that streams the sentences directly from disk/network. Here, we tokenize the tidy tweets in X_train and save the list (pd.series) of lists of tokens to tokenized_tweet.

  1. Import Word2Vec from gensim.models; see doc.

    Create a skip-gram Word2Vec instance named w2v that learns on the tokenized_tweet, with vector_size set at MAX_FEATURES, and other parameters are provided.

  1. Train the skip-gram model, set the epochs at 20.
  1. Let's see how the model performs. Specify a word (e.g., 'dinner' or 'trump') and print out the 10 most similar words from the our tweets in the training set. Use most_similar. Hint: print the type of w2v and w2v.wv.
  1. Discuss: how does it calculate the similarities?

    It uses Cosine similarity. Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

  1. Discuss: do you think Word2Vec is supervised or unsupervised?

    Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered.

  1. Engineer features.

    For each tweet, we calculate the average of embeddings (function word_vector) and then apply it to every tidy tweet in X_train (use function tokens_to_array). Both functions are provided, inspect the code and save the features in w2v_train.

  1. Prepare test data before modeling for each approach:

Task 4. Naive Bayes classifiers

In this task, you will build naive Bayes, another ref, classifiers to identify the hate speech tweets using different sets of features from the last task, and evaluate their performances.

In the era of deep learning, naive Bayes are useful due to its simplicity and reasonable performance, especially if there is not much training data available. A common interview question is "Why is naive Bayes naive?".

We will use multi-variate Bernoulli naive Bayes BernoulliNB; try other flavors of naive Bayes if time permits. Code is pretty straightforward.

  1. Import BernoulliNB for modeling and classification_report for reporting performance.
  1. Create an instance of BernoulliNB named BNBmodel.

    We can use it for all three feature sets.

  1. Train the multi-variate Bernoilli naive Bayes using bag of words features and print the performance report.
  1. Similarly, train the model using tf-idf features and print the performance report.

    Is the performance expected? Why or why not?

  1. Finally, train the model using Word2Vec embeddings and report the performance.
  1. Discuss the differences in performance using tf-idf vs skim-gram embedding.

    YOUR ANSWER HERE

  1. Examine a few tweets where the model(s) failed. What other features would you include in the next iteration?

    YOUR ANSWER HERE

Task 5. Bidirectional LSTM

In this task, you will build a bidirectional LSTM (BiLSTM) model to detect tweets identified as hate speech, and visualize the embedding layer using Tensorboard projector.

Why BiLSTM? LSTM, at its core, preserves information from inputs that has already passed through it using the hidden state. Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past. BiLSTMs run inputs in both ways, one from past to future and one from future to past and show very good results as they can understand context better ref.

  1. Tokenizing and padding.

    As LSTM expects every sentence to be of the same length, in addition to Tokenizer with a given number of vocabulary VOCAB_SIZE, we need to pad shorter tweets with 0s until the length is MAX_LEN and truncate longer tweets to be exact MAX_LEN long.

    Function tokenize_pad_sequences is provided except that you need to supply correct num_words and filters; do NOT filter #.

    We feed the processed tidy_tweet to tokenize_pad_sequences, but one can perform the preprocessing steps in Tokenizer and apply it directly on the raw tweets.

  1. Let's split X into training and testing datasets, save 25% for testing. Then split training dataset into training and validation datasets, with 20% for validation. Set both random_state to be 42. Both splits shall be done by stratification.
  1. Now build a sequential model:

    • an embedding layer
    • a bidirectional LSTM with 32 units and set return_sequences=True in LSTM
    • a global average pooling operation for temporal data
    • a dropout layer with 20% rate
    • a dense layer of 32 units and set the activation function to be ReLu
    • a dense layer of 1 unit and set the proper activation function for classification
  1. Compile the model.

    Fill in a proper loss function and use adam as the optimizer. For metrics, include precision and recall in the metrics, in addition to accuracy.

  1. Train the model for 10 epochs on training dataset with a validation set.
  1. Function plot_graphs is provided below to visualize how the performance of model progresses as a function of epoch.

    Visualize accuracy and loss.

  1. The model starts to overfit in a couple of epochs. Consider using early stopping to stop training when a monitored metric has stopped improving.

    What can we do to tame overfitting?

     using cross-validation, training with more data, or stopping earlier (as suggested).
  1. Print the classification report of the model on test dataset.
  1. Discuss: how does the BiLSTM model improve the classification over naive Bayes?

    YOUR ANSWER HERE

  1. Visualize embedding using Embedding Projector in Tensorboard. The setup for Tensorboard can be tricky, most code is provided.

    TensorBoard reads tensors and metadata from the logs of your tensorflow projects. The path to the log directory is specified with log_dir below.

    In order to load the data into Tensorboard, we need to save a training checkpoint to that directory, along with metadata that allows for visualization of a specific layer of interest in the model.

    Load the TensorBoard notebook extension and import projector from tensorboard.plugins.

  1. Clear any logs from previous runs if any
  1. Set up a logs directory, so Tensorboard knows where to look for data.
  1. Save the first VOCAB_SIZE most frequent words in the vocabulary as metadata.tsv.
  1. Save the weights we want to analyze as a variable. Note that the first value represents any unknown word, which is not in the metadata, here we will remove this value.
  1. Create a checkpoint from embedding, the filename and key are the name of the tensor.
  1. Set up config.
  1. The name of the tensor will be suffixed by /.ATTRIBUTES/VARIABLE_VALUE.
  1. Verify the following files exist under the current directory
  1. Now run Tensorboard against on log data we just saved.

    You may need to run this cell twice to see the projector correctly. Use Chrome for least friction.

The TensorBoard Projector can be a great tool for interpreting and visualzing embedding. The dashboard allows users to search for specific terms, and highlights words that are adjacent to each other in the embedding (low-dimensional) space. Try a few word in the Search box and see if the embeddings make sense.

Task 6. Interpretation

Lastly let's try to understnad predictions by BiLSTM using a model agnostic approach -- Local interpretable model-agnostic explanations (LIME)

  1. Import LimeTextExplainer from the lime_text module in package lime
  1. Create an instance of LimeTextExplainer, call it explanier.
  1. Method explain_instance expects the classifier_fn to be a function, we provide the function predict_proba as below.
  1. Read about explain_instance.

    Create an instance named exp to explain the 16399th tidy tweet from the original dataset, i.e., raw.tidy_tweet.iloc[16399].

  1. Pick another random tweet and generate explanations for the prediction.
  1. Jolt down your observations in explaining the model.

    YOUR ANSWER HERE

Acknowledgement & Reference

Answers to additional questions

How does the Naive Bayes Classifier work? What is Posterior Probability?

The NBC uses Bayes Theorem to predict the probability that the data point belongs to each class. Bayes theorem predicts based on what has already occured, or the posterior probability. What is the difference between stemming and lemmatization in NLP?

Stemming simply removes characters from the ends of words, which can lead to incorrect meanings (e.g., univers for both universal and universe). Lemmatization converts words to a base format while considering the context, which allows the meaning of the words to be kept with the base form. It is interesting to note that in cases where a the same word is used in different context, and therefore has a different meaning, the same word can have differen lemmas. What is Word2Vec and how does it work?

Word2Vec is a predictive deep learning based model which attempts to calculate and extract the relationships between words based on their occurances within a body of text. It works by converting the words into vector representations. Mathematical functions, such as cosine similarity, can be applied to the word vector representations to extract their similarity. When to use GRU over LSTM?

GRU stands for gated recurrent units and LSTM stands for long short-term memory. These are both types of recurrent neural networks. GRUs should be used when you have limited computing resources or need to get your answer faster. LSTMs are more accurate, especially on larger/longer datasets. The main differences are that the GRU does not have an output gate and the input and forget gates are combined into a single update gate.