Object Detection

Today you are a machine learning engineer on the Spatial Perception Team (SPT) at Apple. The goal is to levarage an existing object detector model to automatically detect dogs, an application of transfer learning.

The idea is that you have access to a model that you or one of your colleagues has already trained on a large and diverse set of images, but the model is not very specific for your task (dog detection). We'll do transfer learning by fine-tuning this existing model on a small dataset of dog images.

This is only a small part of the end product -- Visual Look Up feature released in iOS 15 -- snap a picture, identify if an object belonging to five categories (art, landmarks, nature, books, and pets) exists, if yes, highlight with a object symbol, and further identify, e.g., the species of a plant. Examples are shown below (picture credits)

Learning Objectives

At the end of this session, you will be able to

Task 1: Setup

  1. We will run this notebook via Google Colab to take advantage of its free GPU computing power and avoid installation pain.

    It is however not hassle-free, three steps to ensure no errors running this notebook for the time being (this looks familiar to you if you have run the demo notebook):

    • step 1. Runtime > Disconnect and delete runtime
    • step 2. make sure the runtime is GPU: Runtime > Change runtime type
    • step 3. run the following cell to install tensorflow 2.8, as well as the compatible GPU-accelerated library; solution is based on this issue. Depending on the internet connection, it could take a few minutes to finish.

Use the NVIDIA System Management Interface to check the GPU device you are on.

  1. Mount the drive to ensure access to where you store the images on Google Drive.
  1. Clone the tensorflow models repository so we can make use of a model that has already been trained.
  1. Install dependencies of the pretrained models. This can take a few minutes.

Task 2: Load and examine data

  1. Make sure to upload the folder dog_dataset containing all images and annotation files to your google drive /content/drive/My Drive/fourthbrain/dog_dataset.

    Retrieve a list containing the names of the training images in the directory given by train_image_dir. Use os.listdir and store the results to image_names; e.g.,

    image_names[0] # dog_004.jpg
    
  1. Function load_image_into_numpy_array() below is provided to load the an image given path.

    Examine the code and use it to read all the training images and store them in list train_images_np.

  1. What is the shape of each image (width, height and number of color channels)?
  1. Use the following code to draw all 10 images.
  1. Recall our task is to detect where the dog is in the image.

    The object detection model we'll use spits out bounding boxes where it thinks there might be an object. For each bounding box, it also makes a prediction of which class the object is from, and there is an associated "score", or confidence, of this prediction.

    The format for specifying bounding boxes of our training data and the model outputs are [$y_{\min}$, $x_{\min}$, $y_{\max}$, $x_{\max}$], where $y$ is the vertical position and $x$ is the horizontal position. For example, a bounding box of [0.1, 0.15, 0.8, 0.9] means one whose lower left corner is at position (0.1, 0.15) and whose upper right corner is at (0.8, 0.9).

    These values, ranging between 0 and 1, are agnostic to the size of the image, if you wanted to convert them to pixels you would multiply by the width or height of the image (in pixels).

    We will use function visualize_boxes_and_labels_on_image_array in the object detection visualization utilities to display the bounding boxes overlaid on an image.

    First, though, let's read annotations (including the ground-truth bounding boxes) for these training images.

  1. Use function read_content() given below to read in all the annotations and save them in a list gt_boxes.
  1. Inspect and run the following code to create the target tensors for our model to use in training.
  1. Call function visualize_boxes_and_labels_on_image_array() to plot these bounding boxes on their respective images. Look at the images which show the dogs corrected identified in each image.
  1. Follow the similar process to load test images and save them in list test_images_np, except that each image is expected in NHWC format with N set to be 1. Hint: use np.expand_dims.

    Check Table 1. Parameters defining a convolution for what each letter means in format such as NHWC.

  1. Follow the same process as above to load in the bounding box annotations for the test images. Save them in t_gt_boxes.

Task 3: Load and run object detection model

  1. Download the checkpoint and put it into models/research/object_detection/test_data/.

    This model is an SSD (single shot multibox detector) object detetion with Resnet50 backbone with feature pyramid network. You can choose other models, such as YOLO, feasterRCNN, etc.

    Note. For simplicity, a number of things in this notebook is harcoded for the specific RetinaNet architecture at hand, including assuming that the image size will always be 640x640.

    Another note. TensorFlow Hub is now the repository of trained machine learning models (TensorFlow Hub Object Detection Colab tutorial), however it seems that tensorflow hub models are not fine-tunable; see issue.

  1. Run the code below to load the pretrained model.

    One reason it is fairly complex is that the pretrained model requires backward compatibility with TensorFlow 1.0 and because there is some added complexity for managing all the different pretrained models in this particular repository. For some documentation on the simpler and more clean semantics of saving and loading model checkpoints in TensorFlow 2.0 see this documentation

  1. Use the following code to make some predictions before fine-tuning. The model by default will generate 100 possible objects, each with associated scores and predicted classes.

    We utilize the detect() function, which wraps the preprocessing, prediction, and postprocessing step with the tf.function dectorator so that this computation will enjoy faster performance (see this tutorial for more context).

  1. You can see that it makes some bad predictions (though the correct prediction is one of the possibilities). Let's just display its most confident prediction.

Task 4: Implement Computation of Intersection Over Union metric

  1. The Intersection over Union is one way to measure how good a bounding box prediction is.

    Complete the bounding_box_iou() function below. You will need to complete these steps in the code:

    • Determine the coodinates of the intersection rectangle.
    • Compute the area of the intersection rectangle.
    • Compute the area of both the prediction and ground-truth rectangles.
    • Compute and return the intersection over union by taking the intersection area and dividing it by the sum of the prediction and ground-truth areas minus the intersection area.
  1. Use the compute_best_iou_and_score() function to find the best IoU and scores of the test images.

    Remember the test images are stored in t_gt_boxes, the predicted bounding boxes are stored in pre_ft_bb_preds and the prediction scores per bounding box are stored in pre_ft_scores_preds.

Task 5: Fine-tune the model

  1. The variables that we can train are located in the .trainable_variables attribute of the model detection_model. How many variables are there?
  1. Print the names of all the trainable variables.

We're going to fine tune the WeightSharedConvolutionalBoxPredictor layer only. Don't worry about why specifically this layer for the purposes of this tutorial. When you fine tune your own models, picking which parts of it to fine tune are a combination of the inductive bias you impose, and the result of hyperparameter optimization.

  1. Complete the get_model_train_step_function() function.

    • Use the model's .predict() method to generate predictions and save as prediction_dict.
    • Use the model's .loss() method and save as losses_dict.
    • Make total_loss equal to the sum of localization_loss and classification_loss from the losses dictionary.
  1. Set the tuning parameters. We give reasonable values, but feel free to adjust them and see how it affects the convergence of the model.
  1. Complete the following code by completing the optimizer and train_step_fn.

    • Instantiate an SGD optimizer using the learning rate, and a momentum of 0.9 and save it as optimizer.
    • Call the get_model_train_step_function() function to create train_step_fn.
  1. Run the following code to do the fine-tuning.
  1. Now look at the bounding boxes for our fine-tuned model.
  1. Compute the mean IoU and scores for the fine_tuned model following the same steps as in task 4. The performance should be much better than the original model.

References & Acknowledges