Fuel efficiency prediction

Provided with the classic Auto MPG dataset, we will predict the fuel efficiency of the late-1970s and early 1980s automobiles, leveraging features such as cylinders, displacement, horsepower, weight, etc.

It is a very small dataset and there are only a few features. We will first build a linear model and a neural network, evaluate their performances, and then leverage an auto-machine learning (AutoML) library called TPOT to see how it can be used to search over many ML model acchitectures.

Learning Objectives

By the end of this session, you will be able to

Note: State of Data Science and Machine Learning 2021 by Kaggle shows that the most commonly used algorithms were linear and logtistic regressions, followed closely by decision trees, random forests, and gradient boosting machines (are you surprised?). Multilayer perceptron, or artificial neural networks are not yet the popular tools for tabular/structured data; see more technical reasons in papers: Deep Neural Networks and Tabular Data: A Survey, Tabular Data: Deep Learning is Not All You Need. For this assignment, the main purpose is for you to get familiar with the basic building blocks in constructing neural networks before we dive into more specialized neural network architectures.

Task 1. Data: Auto MPG dataset

  1. The dataset is available from the UCI Machine Learning Repository. First download and import the dataset using pandas:
  1. The dataset contains a few unknown values, we drop those rows to keep this initial tutorial simple. Use pd.DataFrame.dropna():
  1. The "Origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with pd.get_dummies.
  1. Split the data into training and test sets. To reduce the module importing overhead, instead of sklearn.model_selection.train_test_split(), use pd.DataFrame.sample() to save 80% of the data aside to train_dataset, set the random state to be 0 for reproducibility.

    Then use pd.DataFrame.drop() to obtain the test_dataset.

  1. Review the joint distribution of a few pairs of columns from the training set.

    The top row suggests that the fuel efficiency (MPG) is a function of all the other parameters. The other rows indicate they are functions of each other.

Let's also check the overall statistics. Note how each feature covers a very different range:

  1. Split features from labels

    Separate the target value—the "label"—from the features. This label is the value that you will train the model to predict.

Task 2. Normalization Layer

It is good practice to normalize features that use different scales and ranges. Although a model might converge without feature normalization, normalization makes training much more stable.

Similar to scikit-learn, tensorflow.keras offers a list of preprocessing layers so that you can build and export models that are truly end-to-end.

  1. The Normalization layer (tf.keras.layers.Normalization is a clean and simple way to add feature normalization into your model. The first step is to create the layer:
  1. Then, fit the state of the preprocessing layer to the data by calling Normalization.adapt:

We can see the feature mean and variance are stored in the layer:

When the layer is called, it returns the input data, with each feature independently normalized:

Task 3. Linear regression

Before building a deep neural network model, start with linear regression using all the features.

Training a model with tf.keras typically starts by defining the model architecture. Use a tf.keras.Sequential model, which represents a sequence of steps.

There are two steps in this multivariate linear regression model:

The number of inputs can either be set by the input_shape argument, or automatically when the model is run for the first time.

  1. Build the Keras Sequential model:
  1. This model will predict 'MPG' from all features in train_features. Run the untrained model on the first 10 data points / rows using Model.predict(). The output won't be good, but notice that it has the expected shape of (10, 1):
  1. When you call the model, its weight matrices will be built—check that the kernel weights (the $m$ in $y = mx + b$) have a shape of (9, 1):
  1. Once the model is built, configure the training procedure using the Keras Model.compile method. The most important arguments to compile are the loss and the optimizer, since these define what will be optimized and how (using the tf.keras.optimizers.Adam).

    Here's a list of built-in loss functions in tf.keras.losses. For regression tasks, common loss functions include mean squared error (MSE) and mean absolute error (MAE). Here, MAE is preferred such that the model is more robust against outliers.

    For optimizers, gradient descent (check this video Gradient Descent, Step-by-Step for a refresher) is the preferred way to optimize neural networks and many other machine learning algorithms. Read an overview of graident descent optimizer algorithms for several popular gradient descent algorithms. Here, we use the popular tf.keras.optimizers.Adam, and set the learning rate at 0.1 for faster learning.

  1. Use Keras Model.fit to execute the training for 100 epochs, set the verbose to 0 to suppress logging and keep 20% of the data for validation:
  1. Visualize the model's training progress using the stats stored in the history object:

Use plot_loss(history) provided to visualize the progression in loss function for training and validation data sets.

  1. Collect the results on the test set for later using Model.evaluate()

Task 4. Regression with a deep neural network (DNN)

You just implemented a linear model for multiple inputs. Now, you are ready to implement multiple-input DNN models.

The code is very similar except the model is expanded to include some "hidden" non-linear layers. The name "hidden" here just means not directly connected to the inputs or outputs.

  1. Include the model and compile method in the build_and_compile_model function below.
  1. Create a DNN model with normalizer (defined earlier) as the normalization layer:
  1. Inspect the model using Model.summary(). This model has quite a few more trainable parameters than the linear models:
  1. Train the model with Keras Model.fit:
  1. Visualize the model's training progress using the stats stored in the history object.

Do you think the DNN model is overfitting? What gives away?

From the plotting curve, we can see that the training curve has smaller error, validation curve be fat after 40 epoches, we may have the variance problem. To solve the issue, we may increase the data size, simplify models or reduce features.

  1. Let's save the results for later comparison.

Task 5. Make predictions

  1. Since both models have been trained, we can review their test set performance:

These results match the validation error observed during training.

  1. We can now make predictions with the dnn_model on the test set using Keras Model.predict and review the loss. Use .flatten().
  1. It appears that the model predicts reasonably well. Now, check the error distribution:
  1. Save it for later use with Model.save:
  1. Reload the model with Model.load_model; it gives identical output:

Task 6. Nonlinearity

We mentioned that the relu activation function introduce non-linearity; let's visualize it. Yet there are six numerical features and 1 categorical features, it is impossible to plot all the dimensions on a 2D plot; we need to simplify/isolate it.

Note: in this task, code is provided; the focus in on understanding.

  1. We focus on the relationship between feature Displacement and target MPG.

    To do so, create a new dataset of the same size as train_features, but all other features are set at their median values; then set the Displacement between 0 and 500.

  1. Create a plotting function to a) visualize real values between Displacement and MPG from the training dataset in scatter plot b) overlay the predicted MPG from Displacement varying from 0 to 500, but holding all other features constant.
  1. Visualize predicted MPG using the linear model.
  1. Visualize predicted MPG using the neural network model. Do you see an improvement/non-linearity from the linear model?
  1. What are the other activation functions? Check the list of activations.

    Optional. Modify the DNN model with a different activation function, and fit it on the data; does it perform better?

  1. Overfitting is a common problem for DNN models, how should we deal with it? Check Regularizers on tf.keras. Any other techiniques that are invented for neural networks?

Reduce overfitting by training the network on more examples. Reduce overfitting by changing the complexity of the network.

Indented block

  1. the network structure (number of weights).
  2. the network parameters (values of weights).

Use regularizer, dropout, early stopping

Task 7. AutoML - TPOT

  1. Instantiate and train a TPOT auto-ML regressor.

    The parameters are set fairly arbitrarily (if time permits, you shall experiment with different sets of parameters after reading what each parameter does). Use these parameter values:

    generations: 10

    population_size: 40

    scoring: negative mean absolute error; read more in scoring functions in TPOT

    verbosity: 2 (so you can see each generation's performance)

    The final line with create a Python script tpot_products_pipeline.py with the code to create the optimal model found by TPOT.

  1. Examine the model pipeline that TPOT regressor offers. If you see any model, function, or class that are not familiar, look them up!

    Note: There is randomness to the way the TPOT searches, so it's possible you won't have exactly the same result as your classmate.

  1. Optional. Take the appropriate lines (e.g., updating path to data and the variable names) from tpot_mpg_pipeline.py to build a model on our training set and make predictions on the test set. Save the predictions as y_pred, and compute appropriate evaluation metric. You may find that for this simple data set, the nueral network we built outperforms the tree-based model, yet note it is not a conclusion that we can be generalized for all tabular data.

Additional Resources

Acknowledgement and Copyright

Acknowledgement

This notebook is adapted from tensorflow/keras tuorial - regression

@title Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

@title MIT License

Copyright (c) 2017 François Chollet

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.