Subscription Prediction with PySpark and MLlib

Learning Objectives

At the end of this session, you will be able to

Part 1: Data Loader

We are using a dataset from the UCI Machine Learning Repository.

  1. Use wget to download the dataset. Then use ls to verify that the bank.zip file is downloaded.
  1. Unzip the file and use ls to see the files.

Part 2: Exploring The Data

We will use the direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (Yes/No) to a term deposit.

  1. Load in the data and look at the columns.

Here are the columns you should see:

  1. Have a peek of the first five observations. Use the .show() method.
  1. To get a prettier result, it can be nice to use Pandas to display the smaller DataFrame. Use the Spark .take() method to get the first 5 rows and then convert to a pandas DataFrame. Don't forget to pass along the column names. You should see the same result as above, but in a more aesthetically appealing format.
  1. How many datapoints are there in the dataset? Use the .count() method.
  1. Use the .describe() method to see summary statistics on the features.

    Note that the result of .describe() is a Spark DataFrame, so the contents won't be displayed. It only has 5 rows, so you can just convert the whole thing to a pandas DataFrame with .toPandas().

  1. The above result includes the columns that are categorical, so don't have useful summary statistics. Let's limit to just the numeric features.

    numeric_features is defined below to contain the column names of the numeric features.

    Use the .select() method to select only the numeric features from the DataFrame and then get the summary statistics on the resulting DataFrame as we did above.

  1. Run the following code to look at correlation between the numeric features. What do you see?

There aren't any highly correlated variables, so we will keep them all for the model. It’s obvious that there aren’t highly correlated numeric variables. Therefore, we will keep all of them for the model. However, day and month columns are not really useful, so we will remove these two columns.

  1. Use the .drop() method to drop the month and day columns.

    Note that this method returns a new DataFrame, so save that result as df.

    Use the .printSchema() method to verify that df now has the correct columns.

Part 3: Preparing Data for Machine Learning

What follows is something analagous to a dataloader pipeline in Tensorflow--we're going to chain together some transformations that will convert our categorical variables into a one-hot format more amenable to training a machine learning model. The next code cell just sets this all up, it doesn't yet run these transformations on our data.

The process includes Category Indexing, One-Hot Encoding and VectorAssembler — a feature transformer that merges multiple columns into a vector column.

The code is taken from databricks’ official site and it indexes each categorical column using the StringIndexer, then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row. We use the StringIndexer again to encode our labels to label indices. Next, we use the VectorAssembler to combine all the feature columns into a single vector column.

  1. Complete the code by completing the assignment of assembler. Use VectorAssembler and pass in assemblerInputs as inputCols and name the outputCol "features".

Part 4: Pipeline

We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow. A Pipeline’s stages are specified as an ordered array.

  1. Fit a pipeline on df.
  1. Transform pipelineModel on df and assign this to variable transformed_df.

From the transformation, we'd like to take the label and features columns as well as the original columns from df.

  1. Use the .select() method to pull these columns from the transformed_df and reassign the resulting DataFrame to df.
  1. View the first five rows of the df DataFrame. Use either of the methods we did in Part 2:
    • .show() method
    • .take() method and convert result to a Pandas DataFrame
  1. Randomly split the dataset in training and test sets, with 70% of the data in the training set and the remaining 30% in the test set.

    Hint: Call the .randomSplit() method.

  1. What are the sizes of the training and test sets?

Part 5: Logistic Regression Model

  1. Fit a LogisticRegression with featuresCol as "features", labelCol as "label" and a maxIter of 10.
  1. We can obtain the coefficients by using LogisticRegressionModel’s attributes. Look at the following plot of the beta coefficients.
  1. Use the .transform() method to make predictions and save them as predictions.
  1. View the first 10 rows of the predictions DataFrame.
  1. What is the area under the ROC curve?

    You can find it with the evaluator.evaluate() function.

OPTIONAL: HyperParameter Tuning a Gradient-Boosted Tree Classifier

  1. Fit and make predictions using GBTClassifier. The syntax will match what we did above with LogisticRegression.
  1. Run some cross validation to compare different parameters.

    Note that it can take a while because it's training over many gradient boosted trees. Give it at least 10 minutes to complete.

Acknowledgements

This notebook is adapted from Machine Learning with PySpark and MLlib