EDA with Walmart Sales Data

Business Objectives

Walmart, the retail giant that operates a chain of hypermarkets, wants to understand their weekly sales data, especially the impact from holidays and or big events on the weekly sales data; specifically, Super Bowl, Labor Day, Thanksgiving, and Christmas. In addition, Walmart wants to consider the effect from different macroeconomic/external factors.

Learning Objectives

At the end of this session, you will know how to

  1. Manipulate data of different types using pandas
  2. Visualize data with matplotlib and seaborn to Extract insights
  3. Build a pipeline to preprocess data and fit a simple model using sklearn

Note: if you see code that's unfamiliar to you, look up for the documentation, and try to understand what it does.

Data Overview

Task I: Load Data

Built on top of numpy, pandas is one of the tools in machine learning. Its rich features are used for exploring, cleaning, visualizing, and transforming data. We need to import the library to access all of its features.

Use pd.read_csv to read train_comb.csv that contains weekly sales, metadata, and macroeconomic features from three stores into a pd.DataFrame.

Verify that the data is loaded correctly by running data.head(3) to see the first few row ( AVOID printing out the entire DataFrame, i.e., data or print(data); it might be trivial for small dataset but it can crash your kernel when the dataset is big and slow down the initial data exploration process ).

Look at the output to get an idea of what each column is and then write a few sentences describing what you notice about the data. You can also use data.sample(3) to draw random samples from the data (hints: number of rows and columns, any missing values? data types of the elements? date ranges of the data collected? etc.).

  1. MarkDown1 to MarkDown5 all contain Null value. Possibly all of these values are null.
  2. IsHoliday is binary value.
  3. CPI vaules have subtle differences. May need normalization

Acceptable responses include the number of rows and columns in the dataset, the data types of the elements, how many NaNs there are (and perhaps which columns and/or rows tend to have them), the range of values in each column or other descriptive statistics, some commentary on what this data represents, any initial concerns about how you think we should model this data, or any other commentary you would like to add.

Use .shape to inspect the size of the data: sample size and number of features.

Expected Output (30990, 16)

For the following task, we focus on Store 1 only,

Retrieve the data from department 9 ( a random choice ) at store 1:

Verify the result using .head(), .shape.

Expected Output (143, 16)

Visualize one full year of sales. The data came with dates sorted, but we can make sure of it and then visualize the first 52 data points.

Do you have any hypotheses about the holidays' impact on the sales?

Sales spiked on holidays. E.g. Halloween, Christmas and new year.

For the result of the notebook, we focus on the sales data from Store 1 in DataFrame df and is saved in train_store1.csv. Let's read in the data.

Extract week, month, and year information from the raw Date column to better manipulate the weekly data later. Pandas comes with powerful features to make this step easy. Reference: tutorial.

First, use .dtypes to check the datatype of the Date column. What's the difference between df[['Date']] and df['Date']?.

Expected Output Date object dtype: object

Verify that the Date column's datatype has changed as expected:

Verify that now there are 19 columns in df:

Last step before we look deeper into the features is to split the data set into training and testing datasets. Discuss: why do we want to perform EDA only on the training data, not the entire dataset? Shouldn't it be the more the better?

  1. To prevent data leakage. If we inspect test data as well, our modeling will always be perfect for both training and test data. Our test data becomes training data essentially.

The answer should mention data leakage, and / or overfitting

Split the data into training (80%) and test dataset (20%). Use function train_test_split from scikit-learn ( a popular library for machine learning in Python ), and set random_state to be 42 for reproducibility ( this is not the best way to do train-test-split due to the temporal nature of the data, however, we will ignore it for now ).

Expected Output ``` Original set ---> (10244, 19) Training set ---> (8195, 19) Testing set ---> (2049, 19) ```

Task II: Target, Features, and Distributions

We inspect the datatype of column Date; now find datatypes for all columns in df_train using .dtypes:

Expected Output ``` Store int64 Dept int64 Date datetime64[ns] Weekly_Sales float64 IsHoliday bool Temperature float64 Fuel_Price float64 MarkDown1 float64 MarkDown2 float64 MarkDown3 float64 MarkDown4 float64 MarkDown5 float64 CPI float64 Unemployment float64 Type object Size int64 week int64 month int64 year int64 dtype: object ```

Summary statistics provide you with a general understanding of the data. Use method .describe(). By default it reports statistics mean, max, min, quantiles for numerical features and counts, unique, mode for categorical features.

Expected Output ``` Store Dept Weekly_Sales Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Size week month year count 8,195.00 8,195.00 8,195.00 8,195.00 8,195.00 2,931.00 2,424.00 2,878.00 2,931.00 2,931.00 8,195.00 8,195.00 8,195.00 8,195.00 8,195.00 8,195.00 mean 1.00 44.65 21,865.28 68.19 3.22 8,045.43 2,961.55 1,236.83 3,683.59 5,023.69 216.00 7.61 151,315.00 25.89 6.47 2,010.96 std 0.00 29.95 27,970.00 14.16 0.43 6,484.49 8,032.30 7,830.99 5,849.69 3,303.07 4.33 0.38 0.00 14.19 3.25 0.80 min 1.00 1.00 -863.00 35.40 2.51 410.31 0.50 0.25 8.00 554.92 210.34 6.57 151,315.00 1.00 1.00 2,010.00 25% 1.00 20.00 3,502.09 57.79 2.76 4,039.39 40.48 6.00 577.14 3,127.88 211.57 7.35 151,315.00 14.00 4.00 2,010.00 50% 1.00 38.00 10,357.32 69.64 3.29 6,154.14 137.86 30.23 1,822.55 4,325.19 215.46 7.79 151,315.00 26.00 6.00 2,011.00 75% 1.00 72.00 31,647.36 80.48 3.59 10,121.97 1,569.00 101.64 3,639.42 6,222.25 220.64 7.84 151,315.00 38.00 9.00 2,012.00 max 1.00 99.00 203,670.47 91.65 3.91 34,577.06 46,011.38 55,805.51 32,403.87 20,475.32 223.44 8.11 151,315.00 52.00 12.00 2,012.0 ```

Inspect the output, what are some of your observations?

  1. There are 2/3 of missing values of MarkDown1 to MarkDown 5
  2. weekly sales have big fluctuations
  3. weekly sales have negative values
  4. Missing date column
  5. data range 2010-2012

Are there any missing values? Use .isna() and .sum() to show the number of missing values from each column.

Expected Output ``` Store 0 Dept 0 Date 0 Weekly_Sales 0 IsHoliday 0 Temperature 0 Fuel_Price 0 MarkDown1 5264 MarkDown2 5771 MarkDown3 5317 MarkDown4 5264 MarkDown5 5264 CPI 0 Unemployment 0 Type 0 Size 0 week 0 month 0 year 0 dtype: int64 ```

What do you think the target variable is in this problem? Assign the column name to target for later use.

Visualize the distribution of target variable using distplot() from library seaborn ( Why seaborn? Check out a comparison between Matplotlib and Seaborn here ). Anything here you observe but the output from .describe does not make obvious? Does it follow a normal distribution?

Notice that there exists nonpositive weekly sales. How many of rows are there that the weekly sales are negative or 0?

What percentage is the negative and zero sales?

After communicating your findings, the stakeholders confirm that you can remove these data entries for now and they are launching an investigation by analysts and data engineers.

Now remove them from the training dataset.

Let's move on to features.

Though almost all the features come as numerical, should they all be treated as numerical features? Let's inspect the number of unique values:

Temperature, CPI, Unemployment, Fuel_Price are continuous. Those tie to the second business objective. Let us put these four into a list and store it in external_factors. From earlier, we noticed that MarkDownx columns contain some missing values, we will treat them in a later task.

Visualize Temperature in a box plot, what do you think the adavange of a box plot over histogram? You can use pd.DataFrame.boxplot(), set the figure size as (6, 4), and turn off the grid.

Let's visualize all four numerical features in both density plot and box plot. Note any observations.

We will investigate the impacts from external factors later. Now let's scan through the other features.

Store, Type, and Size each has only one unique value, offering no information, we can safely ignore them.

We extracted year, month, and week from Date, thus Date is redundant; but it is easy to find the date range in the training dataset using Date:

Our training data ranges from 5th of February 2010 to 26th of October 2012.

It makes more sense to treat year, month, week as categorical, more accurately ordinal; and the boolean feature IsHoliday can be considered as categorical, so can Dept. Let's put these column names into a list categoricalFeatures.

For the categorical features, we are more interested in the frequency of each value, use pd.Series.value_counts to see how many rows where IsHoliday is true and false respectively ( Data imbalance is the norm ).

Expected Output ``` False 7586 True 596 Name: IsHoliday, dtype: int64 ```

Visualize the distribution of month; use sns.countplot().

Discuss with your teamate: there is less data in 2012 than the previous two years, did the sale drop from previous years? Does it affect what we see in the plots for month and week? Does the plot below clarify it to some degree?

Task III: Impact from Holidays

The first business objective is to understand the impact of holidays on weekly sales.

There is a flag provided for us: IsHoliday, let's calculate the average weekly sales for holiday weeks and non-holiday weeks, respectively. Use .groupBy and .mean(). Is holiday sales higher?

Expected Output ``` IsHoliday False 21,756.05 True 23,737.05 Name: Weekly_Sales, dtype: float64 ```

But we would like to understand it at more granular level, remember Simpson's paradox? To save some time, date mapping are identified for the training data

We create one flag for each holiday to help you analyze weekly sale by each holiday type

Run the next cell to see 1) how many weekly sales fell on Christmas (does it make sense? what did we not account for?) 2) what is the average weekly sales stratified by whether it is Christmas week or not?

Perform the same for the other three holidays:

Expected Output ``` count avg_weekly_sales superbowl False 8001 21,845.80 True 181 24,311.98 count avg_weekly_sales laborday False 8007 21,884.35 True 175 22,632.78 count avg_weekly_sales thanksgiving False 8067 21,813.97 True 115 27,959.84 count avg_weekly_sales christmas False 8057 21,921.06 True 125 20,565.56 ```

Without hypothesis testing and by only eyeballing, it seems like Super Bowl and Thanksgiving has a positive impact on the weekly sales for Store 1 in this training dataset. Discuss with your teammate, are you surprised that during Christmas, sales at Walmart do not go up? Holiday effect, if causal, happened most during Thanksgiving weeks, is this something you expected?

We have been ignoring Dept, let's take a look at the plot below showing the weekly sales by department in 2011.

Dept 72 has a very unusual high weekly sales during the holiday week, but we will need more data to understand if this is data issue, outlier, or special event.

Task IV: Visualize Relationship between Macroeconomic & External Factors and Sales

By eyeballing, do you find strong evidence that those are correlated with Walmart's weekly sales? Do you think lineplot is an appropriate plot for this?

Lastly, we calculate the spearman correlations among target and external factors and verify that there is no strong linear correlation between the target variable and these features.

The heatmap provides insights of the correlation among these properties. In the feature selection, we can eliminate the highly correlated value. In this case CPI and fuel_price correlation coefficiency is 0.7, unemployment and CPI is -0.6. We can adopt only one of them for the training. Including features with high correlation coefficiency will be redundant.

Task V: Feature Engineering

"Feature Engineering encapsulates various data engineering techniques such as selecting relevant features, handling missing data, encoding the data, and normalizing it. It is one of the most crucial tasks and plays a major role in determining the outcome of a model." Ref.

One part of feature engineering is to create new features from given data, like thanksgiving column earlier was derived from Date. Common techniques for tabular data include to add summary statistics of the numerical features such as mean and standard deviation, to create new features from the interaction of multiple features, etc. In this task, however, we will work on handling missing data, normalizing numerical features, and encoding categorical features.

First, missing data. Missing value treatment is crucial, yet not trivial. Take a read on Tackling Missing Value in Dataset for detailed explanation. Features with nulls or wrong values (e.g., negative fuel price) needs to be imputed or removed.

From ealier steps, we observed that only the markdown columns contain missing values, yet we do not have more information on what those values are for.

For each column, find out the percentage of the data is missing

Expected Output ``` MarkDown1: 64% is missing MarkDown2: 70% is missing MarkDown3: 65% is missing MarkDown4: 64% is missing MarkDown5: 64% is missing ```

Marjority of the markdown fields are missing. This is where, again, we need to communicate with the stakeholders to understand what the data measure, how the data was collected and then determine our strategy from there. Here, for simplicity, we impute all missing values with median of the column. Use .fillna() to impute the missing values.

Visualize the distributions for those markdown fields after imputations, are they normal? They are not normal distributions.

What distribution are these data follow? We can explore it with fitter library

Note that missing values are different from outliers. Outliers, on the other hand, are feature values that are rare in nature. They can unncessarily skew the data and causes problem for modeling. Outlier treatment involves removing or imputing such values. One popular approach to identify outliers is IQR; that is, data points that lie 1.5 times of IQR above Q3 (third quartile) and below Q1 (first quartile) are outliers. Take a read on Detecting and Treating Outliers. We will leave it as an optional exercise for you to identify outliers using IQR, and replace the outliers with the median.

We can checkout the distribution of filtered data. Filtered data don't include outliers.

Then we can Check distibutions of outliers we need to investigate the outliers with busniess goals or domain knowledge to understand them better.

--Conclusion-1:outlier and department. Above outlier analysis with IQR lead us to know more about which department has these weekly sales outliers. Fiver departments' data contibute to these outliers: 95, 92, 90, 38.

--Conclusion 2: outlier seasons. There are no obvious seasonality for these outlier data. And most of outliers originate from non-holiday transactions. The count/data have little variation from year-to-year.

Now let's see how we normalize the data. For numerical features it means scaling the features to be of similar range. This step is crucial for machine learning algorithms that calculate distances between data (e.g., read The Importance of Feature Scaling.

For this task, of the external features, let's keep Temperature since it is the most linearly correlated with the target variable, though very weak and negative ( feature selection ). In addition, we include one markdown field. Since neither seems to follow normal distributions, it is safer to use MinMaxScaler from sklearn.preprocessing to transform features by scaling each feature to a given range (See discussion on Normalization vs Standardization)

Instantiate a MinMaxScaler and fit using df_train_num:

Now transform training data df_train_num and store the resulting nparray in train_norm:

Verify that both columns now have minimum 0 and maximum 1.

Let's turn to categorical fatures. So far most, if not all Python packages for modeling do not accept strings as input; thus encoding the categorical value to numerical value is a necessary step. Here, let's apply one-hot encoding on Dept and IsHoliday:

Transform the categorical features using one hote encoding ohe.

The number of columns explodes from 2 to 79.

Lastly we merge the processed numerical features with the processed categorical features using hstack in numpy:

What about the test data? Yes you need to apply the same treatments. We spare some copy + paste + edit and see how this can be done when we introduce pipeline next.

Task VI: Pipeline

Even with less than 20 features in our dataset, there are many many possibilities that you can preprocessing the data. There is no one-fits-all approach; often you will find yourself experimenting with many combinations to achieve better modelling performance: Should I apply normalization or standardization? Do I remove the outliers or should I impute them? Do I impute the missing values with median or mean or 0? Answers to many of these questions is "It depends." (Have you heard Graduate Student Descent?) That means trial-and-error and it is not efficient to produce a notebook each time when you need to try something slightly different. You will get lost quickly. Pipeline is one useful tool.

Not only does Pipeline help streamline the process, keep the code modular, but also reduces the possibility of introducing errors/bugs. In this task, we build the pipeline following the strategies used in the last task, run a simple linear regression model, and print out the model's performance. Note there is minimal code required for you to implement, the key is to understand each step.

To avoid confusion, let's read the data again directly from train-store1.csv.

Separating the target y from the features X:

Import Pipeline from submodule sklearn.pipeline

Now we build a transformer for numerical features following two steps: impute the missing values with the feature median (use SimpleImputer), followed by normalization (use MinMaxScaler)

For categorical features, we apply one hot encoding OneHotEncoder ( there are many other options; see Scikit-learn documentation ):

Piece the numeric_transformer and categorical_transformer using ColumnTransformer:

Lastly, let's append the regression model to preprocessing pipeline to complete a full prediction pipeline.

The pipepline has been built! The rest is to

Let's run the prediccction

Optional: Discuss what type of Feature Selection strategy you would use to select the features.

Can we predict the weekly sales with the given data?

weekly sales don't have correlations with numerical data. we may try tree method, here, Random Forest to predit the sale.

from sklearn.ensemble import RandomForestClassifier

numeric_features_2 = ['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'Unemployment']

numeric_transformer_2 = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ("normalization", MinMaxScaler()) ])

preprocessor_2 = ColumnTransformer( transformers=[ ("num", numeric_transformer_2, numeric_features_2), ("cat", categorical_transformer, categorical_features), ] )

model_2 = Pipeline( steps=[("preprocessor", preprocessor_2), ("model", RandomForestClassifier(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=80, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False))] )

model_2.fit(X_train, y_train) print("model score: %.3f" % model_2.score(X_test, y_test))

Note

References

  1. Original dataset is from kaggle: wallmart sales forecast datasets
  2. Notebook: craking the walmart sales forecasting challenge