Creation of baseline model for GRI

This notebook loads data and formats it for injestion by several initial models. Baseline performance is established for a prediction of future Glycemic Risk Index (GRI). Specifically, users are classified based on whether they have a GRI above 40 in a future 2 week period based on the previous 2 weeks of daily-aggregated CGM data. Note that GRI has been correlated to clinicians' ratings of glycemia in this paper.

Import statements

Data loading

Quick look at variables we have in user info file

Generate additional variables from user info

Create variable for length of glooko app use, has_other_count (boolean), and use one-hot encoder on APPLICATION_NAME and OS (to separate those categorical variables out into individual booleans). Note that APPLICATION_NAME is the application used to upload glucose data, while OS is the system operating system used for the upload.

Quick look at variables we have in daily stats file

Generate additional (time series) variables from daily stats

Note that some users don't have 100% active time. (And some users have over 100% active time.) For this analysis we will assume the data we do have is representative of the entire day. Under this assumption, if there are 5 readings below 70 out of 100 readings for the day, then this indicates the user had 5% of time below range and we will assume the user had 5% of time below range for the entire day. In this section we will change the units of the ABOVE* and BELOW variables from 'counts' to '% of time'. Note that a typical glucose monitor records readings every five minutes. Therefore, an active time of 1.0 (100%) is equal to 288 readings (24 hours 60 min / hr * 1 reading / 5 min = 288 readings).

This is what the above/below count variables look like.

The output above shows the variables after the units have been changed. The variables ABOVE* and BELOW* are affected.

Now we will create the Glycemic Risk Index (GRI) variable. The GRI is defined by the formula below. Note that a large value indicates a higher glycemic risk, which is associated with negative health outcomes

We also need to create the test (prediction) variables. We will use the average GRI for last 14 days and the average GRI for the next (future) 14 days. Note that the GRI has a notable amount of variation day-to-day, which is why we will be using average values to get a sense of either 'improvement' or 'worsening' of glycemic risk on a longer time scale.

The figure above shows how the GRI variables (daily, and 2-week past and future averages) change over a month-long period.

The figure below shows that for each of the days listed, every user in the dataset (a total of 4130 users) have data.

The figure above shows the distribution of future 2-week average GRI for every user on a specific date. As shown below, the mean GRI is 39 and median is 36.

Also include variables from the user info data file

The index of X is the user id. We will join the data together on ID.

Convert daily variables into structured data format

In this section, we take past data from the last 2 weeks and place it into new variables all in the same row.

The insulin variables have NaNs. For now we will drop them and replace with a variable indicating whether insulin data was recorded. In the future, we woul dlike to use this data because it contains interesting information.

Create the prediction variable

Split data into train and test sets

View feature correlation

And we will remove highly correlated features

Create a couple functions for visualizing results

Check the performance of a super basic model

Where we assume users with GRI below 40 will stay below 40 and users with GRI above 40 will stay above 40 (i.e., no change in GRI in the future).

Only about 14% of users cross the GRI = 40 threshold from the average for the past to the future 2 week time periods. This is a small amount of the total numbe rof users, indicating that most users stay in a similar GRI range. It also indicates that we should expect to get most of our ML models predictive power out of this single variable (GRI_avg2wk).

Fit a Random Forest Classifier

Fit a Logistic Regression Model

The default parameters for the random forest classifier and the logistic regression model give similar performance.

K-Fold validation and recursive feature elimination

As expected, most of the predictive power is derived from the first feature -- the average GRI over the past 2 weeks. This isn't surprising, since very few users cross the GRI = 40 threshold in the test dataset, as mentioned above. Although we have many features that correspond to previous days' data, none of these features appear to given a significant amount of predictive power on their own. However, if a small amount of predictive power could be gained from multiple features in series, we might expect a gradient boosted algorithm to perform a little better. If there is not significant predictive power in these other features, then a gradient boosted classifier may not provide a better prediction.

Gradient Boosted Classifier

The gradient boosted classifier does not provide any gain over use of logistic regression or random forest classifier ML models. This suggests that the ability to predict future GRI above 40 may be a trivial problem, largely solved by looking at past GRI, or that we have not found features which contain the information needed to create such a prediction.

A quick look at how much GRI changes

We'll look at the users than crossed the threshold of GRI = 40, and users overall

As shown above, for the 1726 users (out of 4130) that cross the GRI = 40 threshold, the average change is small (1.4). However, some users do have large changes in GRI! It will be interesting to try to capture something about those users. Below, we plot a similar histogram of the change in GRI but for all users.

Using autoML to find an optimized model

Using TPOT to find an optimized model

Concluding remarks

  1. Glycemic risk depends on many factors and can undergo large variations in short time periods because the underlying glucose measurements that the calculation for GRI is based on are highly variable.
  2. Several base ML models were created to predict whether an average Glycemic Risk Indicator (GRI) score for a future 2-week period would be above 40, based on the prior 2 weeks of CGM data. These models all gave an accuracy in the low to mid-80% range.
  3. TPOT was also used to find an optimal model (runtime ~1 hour) via autoML algorithms.

Future work