Walmart, the retail giant that operates a chain of hypermarkets, wants to understand their weekly sales data, especially the impact from holidays and or big events on the weekly sales data; specifically, Super Bowl, Labor Day, Thanksgiving, and Christmas. In addition, Walmart wants to consider the effect from different macroeconomic/external factors.
At the end of this session, you will know how to
pandas
matplotlib
and seaborn
to Extract insights sklearn
Note: if you see code that's unfamiliar to you, look up for the documentation, and try to understand what it does.
Original sales data were collected from 45 stores across the United States; yet for this session, you will first inspect data from three stores and later focus on just store 1.
Each store is of certain type and size, and there are multiple departments in a store.
The dataset has a temporal component, we ignore this mostly in this session and will discuss time series related techniques later in the cohort.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # allow multiple outputs in a cell
import warnings
warnings.filterwarnings("ignore")
Built on top of numpy
, pandas
is one of the tools in machine learning. Its rich features are used for exploring, cleaning, visualizing, and transforming data. We need to import the library to access all of its features.
import pandas as pd
Use pd.read_csv
to read train_comb.csv
that contains weekly sales, metadata, and macroeconomic features from three stores into a pd.DataFrame
.
filepath = '../dat/train_comb.csv'
data = pd.read_csv(filepath)
Verify that the data is loaded correctly by running data.head(3)
to see the first few row ( AVOID printing out the entire DataFrame, i.e., data
or print(data)
; it might be trivial for small dataset but it can crash your kernel when the dataset is big and slow down the initial data exploration process ).
data.head(3)
Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | 24924.50 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | A | 151315 |
1 | 1 | 1 | 2010-02-12 | 46039.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | A | 151315 |
2 | 1 | 1 | 2010-02-19 | 41595.55 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | A | 151315 |
Look at the output to get an idea of what each column is and then write a few sentences describing what you notice about the data. You can also use data.sample(3)
to draw random samples from the data (hints: number of rows and columns, any missing values? data types of the elements? date ranges of the data collected? etc.).
Acceptable responses include the number of rows and columns in the dataset, the data types of the elements, how many NaNs there are (and perhaps which columns and/or rows tend to have them), the range of values in each column or other descriptive statistics, some commentary on what this data represents, any initial concerns about how you think we should model this data, or any other commentary you would like to add.
Use .shape
to inspect the size of the data: sample size and number of features.
data.shape
(30990, 16)
For the following task, we focus on Store 1
only,
data_store1 = data.loc[data['Store']==1]
data_store1.shape
(10244, 16)
Retrieve the data from department 9 ( a random choice ) at store 1:
data_store1_dept9 = data_store1[data_store1.Dept == 9]
Verify the result using .head()
, .shape
.
data_store1_dept9.head()
data_store1_dept9.shape
Store | Dept | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Type | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1144 | 1 | 9 | 2010-02-05 | 16930.99 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | A | 151315 |
1145 | 1 | 9 | 2010-02-12 | 16562.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | A | 151315 |
1146 | 1 | 9 | 2010-02-19 | 15880.85 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | A | 151315 |
1147 | 1 | 9 | 2010-02-26 | 15175.52 | False | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | A | 151315 |
1148 | 1 | 9 | 2010-03-05 | 24064.70 | False | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | A | 151315 |
(143, 16)
Visualize one full year of sales. The data came with dates sorted, but we can make sure of it and then visualize the first 52 data points.
data_store1_dept9 = data_store1_dept9.sort_values('Date')
data_store1_dept9[['Date', 'Weekly_Sales']].iloc[:52]\
.set_index('Date').plot(rot=90);
Do you have any hypotheses about the holidays' impact on the sales?
Sales spiked on holidays. E.g. Halloween, Christmas and new year.
For the result of the notebook, we focus on the sales data from Store 1 in DataFrame df
and is saved in train_store1.csv
. Let's read in the data.
df = pd.read_csv("../dat/train-store1.csv")
Extract week, month, and year information from the raw Date
column to better manipulate the weekly data later. Pandas comes with powerful features to make this step easy. Reference: tutorial.
First, use .dtypes
to check the datatype of the Date
column. What's the difference between df[['Date']]
and df['Date']
?.
df[['Date']].dtypes
Date object dtype: object
df.Date=pd.to_datetime(df.Date)
Verify that the Date
column's datatype has changed as expected:
df[['Date']].dtypes
Date datetime64[ns] dtype: object
df['week'] = df.Date.dt.week
df['month'] = df.Date.dt.month
df['year'] = df.Date.dt.year
Verify that now there are 19 columns in df
:
df.shape
(10244, 19)
Last step before we look deeper into the features is to split the data set into training and testing datasets. Discuss: why do we want to perform EDA only on the training data, not the entire dataset? Shouldn't it be the more the better?
The answer should mention data leakage, and / or overfitting
Split the data into training (80%) and test dataset (20%). Use function train_test_split
from scikit-learn
( a popular library for machine learning in Python ), and set random_state
to be 42 for reproducibility ( this is not the best way to do train-test-split due to the temporal nature of the data, however, we will ignore it for now ).
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.20, random_state = 42)
print('Original set ---> ',df.shape,
'\nTraining set ---> ',df_train.shape,
'\nTesting set ---> ', df_test.shape)
Original set ---> (10244, 19) Training set ---> (8195, 19) Testing set ---> (2049, 19)
We inspect the datatype of column Date
; now find datatypes for all columns in df_train
using .dtypes
:
df_train.dtypes
Store int64 Dept int64 Date datetime64[ns] Weekly_Sales float64 IsHoliday bool Temperature float64 Fuel_Price float64 MarkDown1 float64 MarkDown2 float64 MarkDown3 float64 MarkDown4 float64 MarkDown5 float64 CPI float64 Unemployment float64 Type object Size int64 week int64 month int64 year int64 dtype: object
Summary statistics provide you with a general understanding of the data. Use method .describe()
. By default it reports statistics mean, max, min, quantiles for numerical features and counts, unique, mode for categorical features.
pd.options.display.float_format = "{:,.2f}".format
df_train.describe()
Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | week | month | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 2,931.00 | 2,424.00 | 2,878.00 | 2,931.00 | 2,931.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 | 8,195.00 |
mean | 1.00 | 44.65 | 21,865.28 | 68.19 | 3.22 | 8,045.43 | 2,961.55 | 1,236.83 | 3,683.59 | 5,023.69 | 216.00 | 7.61 | 151,315.00 | 25.89 | 6.47 | 2,010.96 |
std | 0.00 | 29.95 | 27,970.00 | 14.16 | 0.43 | 6,484.49 | 8,032.30 | 7,830.99 | 5,849.69 | 3,303.07 | 4.33 | 0.38 | 0.00 | 14.19 | 3.25 | 0.80 |
min | 1.00 | 1.00 | -863.00 | 35.40 | 2.51 | 410.31 | 0.50 | 0.25 | 8.00 | 554.92 | 210.34 | 6.57 | 151,315.00 | 1.00 | 1.00 | 2,010.00 |
25% | 1.00 | 20.00 | 3,502.09 | 57.79 | 2.76 | 4,039.39 | 40.48 | 6.00 | 577.14 | 3,127.88 | 211.57 | 7.35 | 151,315.00 | 14.00 | 4.00 | 2,010.00 |
50% | 1.00 | 38.00 | 10,357.32 | 69.64 | 3.29 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 215.46 | 7.79 | 151,315.00 | 26.00 | 6.00 | 2,011.00 |
75% | 1.00 | 72.00 | 31,647.36 | 80.48 | 3.59 | 10,121.97 | 1,569.00 | 101.64 | 3,639.42 | 6,222.25 | 220.64 | 7.84 | 151,315.00 | 38.00 | 9.00 | 2,012.00 |
max | 1.00 | 99.00 | 203,670.47 | 91.65 | 3.91 | 34,577.06 | 46,011.38 | 55,805.51 | 32,403.87 | 20,475.32 | 223.44 | 8.11 | 151,315.00 | 52.00 | 12.00 | 2,012.00 |
Inspect the output, what are some of your observations?
Are there any missing values? Use .isna()
and .sum()
to show the number of missing values from each column.
df_train.isna().sum()
Store 0 Dept 0 Date 0 Weekly_Sales 0 IsHoliday 0 Temperature 0 Fuel_Price 0 MarkDown1 5264 MarkDown2 5771 MarkDown3 5317 MarkDown4 5264 MarkDown5 5264 CPI 0 Unemployment 0 Type 0 Size 0 week 0 month 0 year 0 dtype: int64
What do you think the target variable is in this problem? Assign the column name to target
for later use.
target = 'Weekly_Sales'
Visualize the distribution of target variable using distplot()
from library seaborn
( Why seaborn? Check out a comparison between Matplotlib and Seaborn here ). Anything here you observe but the output from .describe
does not make obvious? Does it follow a normal distribution?
import seaborn as sns
sns.distplot(df_train[target],bins=10);
Notice that there exists nonpositive weekly sales. How many of rows are there that the weekly sales are negative or 0?
(df_train[target] <= 0).sum() # Expected Output: 13
13
What percentage is the negative and zero sales?
(df_train[target] <= 0).sum()/df_train.shape[0]
# Expected Output: 0.0015863331299572911 or 0.16%
0.0015863331299572911
After communicating your findings, the stakeholders confirm that you can remove these data entries for now and they are launching an investigation by analysts and data engineers.
Now remove them from the training dataset.
mask = df_train[target] > 0
df_train = df_train[mask]
df_train.shape # Expected Output: (8182, 19)
(8182, 19)
Let's move on to features.
Though almost all the features come as numerical, should they all be treated as numerical features? Let's inspect the number of unique values:
[(col, df[col].nunique())for col in df_train.columns]
[('Store', 1), ('Dept', 77), ('Date', 143), ('Weekly_Sales', 10042), ('IsHoliday', 2), ('Temperature', 143), ('Fuel_Price', 137), ('MarkDown1', 51), ('MarkDown2', 41), ('MarkDown3', 49), ('MarkDown4', 51), ('MarkDown5', 51), ('CPI', 143), ('Unemployment', 12), ('Type', 1), ('Size', 1), ('week', 52), ('month', 12), ('year', 3)]
Temperature
, CPI
, Unemployment
, Fuel_Price
are continuous. Those tie to the second business objective. Let us put these four into a list and store it in external_factors
. From earlier, we noticed that MarkDownx
columns contain some missing values, we will treat them in a later task.
external_factors = ['Temperature','CPI','Unemployment', 'Fuel_Price']
Visualize Temperature in a box plot, what do you think the adavange of a box plot over histogram? You can use pd.DataFrame.boxplot()
, set the figure size as (6, 4), and turn off the grid.
df.boxplot(column=['Temperature'], grid=False, figsize=(6,4))
<AxesSubplot:>
df.boxplot(column=external_factors, grid=False, figsize=(6,4))
<AxesSubplot:>
# Expected Output:
Let's visualize all four numerical features in both density plot and box plot. Note any observations.
import matplotlib.pyplot as plt
print('\033[1mNumeric Features Distribution'.center(100))
figsize = (12, 4)
n=len(external_factors)
colors = ['g', 'b', 'r', 'y', 'k']
# histogram
plt.figure(figsize=figsize)
for i in range(len(external_factors)):
plt.subplot(1,n,i+1)
sns.distplot(df_train[external_factors[i]],
bins=10,
color = colors[i])
plt.tight_layout();
# boxplot
plt.figure(figsize=figsize)
for i in range(len(external_factors)):
plt.subplot(1,n,i+1)
df_train.boxplot(external_factors[i], grid=False)
plt.tight_layout();
Numeric Features Distribution
We will investigate the impacts from external factors later. Now let's scan through the other features.
Store
, Type
, and Size
each has only one unique value, offering no information, we can safely ignore them.
We extracted year
, month
, and week
from Date
, thus Date
is redundant; but it is easy to find the date range in the training dataset using Date
:
df_train['Date'].min(), df_train['Date'].max() # Expected Output: (Timestamp('2010-02-05 00:00:00'), Timestamp('2012-10-26 00:00:00'))
(Timestamp('2010-02-05 00:00:00'), Timestamp('2012-10-26 00:00:00'))
Our training data ranges from 5th of February 2010 to 26th of October 2012.
It makes more sense to treat year
, month
, week
as categorical, more accurately ordinal; and the boolean feature IsHoliday
can be considered as categorical, so can Dept
. Let's put these column names into a list categoricalFeatures
.
categoricalFeatures = ['year','month','week','IsHoliday', 'Dept']
For the categorical features, we are more interested in the frequency of each value, use pd.Series.value_counts
to see how many rows where IsHoliday
is true and false respectively ( Data imbalance is the norm ).
df_train.IsHoliday.value_counts()
False 7586 True 596 Name: IsHoliday, dtype: int64
Visualize the distribution of month
; use sns.countplot()
.
sns.countplot(x = 'month', data=df_train)
<AxesSubplot:xlabel='month', ylabel='count'>
#Visualising the categorical features
print('\033[1mVisualising Categorical Features:'.center(100))
plt.figure(figsize=(12,12))
for i in range(len(categoricalFeatures)):
plt.subplot(6,1,i+1)
sns.countplot(df_train[categoricalFeatures[i]])
plt.tight_layout();
Visualising Categorical Features:
Discuss with your teamate: there is less data in 2012 than the previous two years, did the sale drop from previous years? Does it affect what we see in the plots for month and week? Does the plot below clarify it to some degree?
plt.figure(figsize=(12, 6))
sns.lineplot(data=df_train, x="week", y="Weekly_Sales", style='year');
The first business objective is to understand the impact of holidays on weekly sales.
There is a flag provided for us: IsHoliday
, let's calculate the average weekly sales for holiday weeks and non-holiday weeks, respectively. Use .groupBy
and .mean()
. Is holiday sales higher?
#df_train.groupby(by=['IsHoliday']).mean()
df_train[['Weekly_Sales','IsHoliday']].groupby(by=['IsHoliday']).mean()
#df_train.groupby().mean()
df_train.groupby('IsHoliday')['Weekly_Sales'].mean()
Weekly_Sales | |
---|---|
IsHoliday | |
False | 21,756.05 |
True | 23,737.05 |
IsHoliday False 21,756.05 True 23,737.05 Name: Weekly_Sales, dtype: float64
But we would like to understand it at more granular level, remember Simpson's paradox? To save some time, date mapping are identified for the training data
We create one flag for each holiday to help you analyze weekly sale by each holiday type
superbowl_mask = df_train['Date'].isin(['2010-02-12', '2011-02-11', '2012-02-10'])
laborday_mask = df_train['Date'].isin(['2010-09-10', '2011-09-09','2012-09-07'])
thanksgiving_mask = df_train['Date'].isin(['2010-11-26', '2011-11-25'])
christmas_mask = df_train['Date'].isin(['2010-12-31', '2011-12-30'])
df_train['superbowl'] = superbowl_mask
df_train['laborday'] = laborday_mask
df_train['thanksgiving'] =thanksgiving_mask
df_train['christmas'] = christmas_mask
Run the next cell to see 1) how many weekly sales fell on Christmas (does it make sense? what did we not account for?) 2) what is the average weekly sales stratified by whether it is Christmas week or not?
df_train.groupby(['christmas'])\
.agg(count = ('christmas', 'size'),
avg_weekly_sales= ('Weekly_Sales','mean'))
count | avg_weekly_sales | |
---|---|---|
christmas | ||
False | 8057 | 21,921.06 |
True | 125 | 20,565.56 |
Perform the same for the other three holidays:
holidays = ['superbowl', 'laborday', 'thanksgiving', 'christmas']
for holiday in holidays:
summary_stats = df_train.groupby([holiday])\
.agg(count = (holiday, 'size'),
avg_weekly_sales= ('Weekly_Sales','mean'))
print(summary_stats)
print()
count avg_weekly_sales superbowl False 8001 21,845.80 True 181 24,311.98 count avg_weekly_sales laborday False 8007 21,884.35 True 175 22,632.78 count avg_weekly_sales thanksgiving False 8067 21,813.97 True 115 27,959.84 count avg_weekly_sales christmas False 8057 21,921.06 True 125 20,565.56
Without hypothesis testing and by only eyeballing, it seems like Super Bowl and Thanksgiving has a positive impact on the weekly sales for Store 1 in this training dataset. Discuss with your teammate, are you surprised that during Christmas, sales at Walmart do not go up? Holiday effect, if causal, happened most during Thanksgiving weeks, is this something you expected?
We have been ignoring Dept
, let's take a look at the plot below showing the weekly sales by department in 2011.
plt.figure(figsize=(10,4))
sns.scatterplot(data=df_train[df_train.year==2011], x = 'Dept', y= target, hue='IsHoliday');
Dept 72 has a very unusual high weekly sales during the holiday week, but we will need more data to understand if this is data issue, outlier, or special event.
sns.lineplot(data=df_train, x="Fuel_Price", y="Weekly_Sales");
sns.lineplot(data=df_train, x="Temperature", y="Weekly_Sales");
sns.lineplot(data=df_train, x="CPI", y="Weekly_Sales");
sns.lineplot(data=df_train, x="Unemployment", y="Weekly_Sales");
By eyeballing, do you find strong evidence that those are correlated with Walmart's weekly sales? Do you think lineplot
is an appropriate plot for this?
Lastly, we calculate the spearman correlations among target and external factors and verify that there is no strong linear correlation between the target variable and these features.
plt.figure(figsize=(6, 6))
df_train_reduced = df_train[[target] + external_factors]
corr = df_train_reduced.corr(method='spearman')
heatmap = sns.heatmap(corr.sort_values(by=target, ascending=False),
vmin=-1, vmax=1, annot=True, fmt='.1g', cmap='BrBG')
heatmap.set_title('Features Correlating with Sales Price', fontdict={'fontsize':12}, pad=16);
The heatmap provides insights of the correlation among these properties. In the feature selection, we can eliminate the highly correlated value. In this case CPI and fuel_price correlation coefficiency is 0.7, unemployment and CPI is -0.6. We can adopt only one of them for the training. Including features with high correlation coefficiency will be redundant.
"Feature Engineering encapsulates various data engineering techniques such as selecting relevant features, handling missing data, encoding the data, and normalizing it. It is one of the most crucial tasks and plays a major role in determining the outcome of a model." Ref.
One part of feature engineering is to create new features from given data, like thanksgiving
column earlier was derived from Date
. Common techniques for tabular data include to add summary statistics of the numerical features such as mean and standard deviation, to create new features from the interaction of multiple features, etc. In this task, however, we will work on handling missing data, normalizing numerical features, and encoding categorical features.
First, missing data. Missing value treatment is crucial, yet not trivial. Take a read on Tackling Missing Value in Dataset for detailed explanation. Features with nulls or wrong values (e.g., negative fuel price) needs to be imputed or removed.
From ealier steps, we observed that only the markdown columns contain missing values, yet we do not have more information on what those values are for.
df_train.columns[df_train.isna().sum() != 0]
Index(['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'], dtype='object')
For each column, find out the percentage of the data is missing
md_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
for col in ['MarkDown'+str(i) for i in range(1,6)]:
perc_missing = df_train[col].isna().sum()/df_train[col].shape[0]#perc_missing:float
print (f'{col}: {perc_missing:.0%} is missing')
MarkDown1: 64% is missing MarkDown2: 70% is missing MarkDown3: 65% is missing MarkDown4: 64% is missing MarkDown5: 64% is missing
Marjority of the markdown fields are missing. This is where, again, we need to communicate with the stakeholders to understand what the data measure, how the data was collected and then determine our strategy from there. Here, for simplicity, we impute all missing values with median of the column. Use .fillna()
to impute the missing values.
# YOUR CODE HERE # this works for smaller dataset #use median value.
df_train = df_train.fillna(df_train.median())
df_train.isna().sum()
Store 0 Dept 0 Date 0 Weekly_Sales 0 IsHoliday 0 Temperature 0 Fuel_Price 0 MarkDown1 0 MarkDown2 0 MarkDown3 0 MarkDown4 0 MarkDown5 0 CPI 0 Unemployment 0 Type 0 Size 0 week 0 month 0 year 0 superbowl 0 laborday 0 thanksgiving 0 christmas 0 dtype: int64
(df_train.isna().sum() != 0).sum() # sanity check: 0
0
Visualize the distributions for those markdown fields after imputations, are they normal? They are not normal distributions.
plt.figure(figsize=figsize)
for i in range(len(md_cols)):
plt.subplot(1,len(md_cols),i+1)
sns.distplot(df_train[md_cols[i]],
hist_kws=dict(linewidth=2),
bins=10,
color = colors[i])
plt.tight_layout();
#inspect outliers: valid,
#identify importance
What distribution are these data follow? We can explore it with fitter library
from fitter import Fitter, get_common_distributions, get_distributions
plt.figure(figsize=figsize)
for i in range(len(md_cols)):
plt.subplot(1,len(md_cols),i+1)
f = Fitter(df_train[md_cols[i]].values,
distributions=['gamma',
'lognorm',
"beta",
"burr",
"norm"])
f.fit()
f.summary()
plt.tight_layout();
#f = Fitter(df_train[md_cols[2]].values)
#f.fit()
#f.summary()
Note that missing values are different from outliers. Outliers, on the other hand, are feature values that are rare in nature. They can unncessarily skew the data and causes problem for modeling. Outlier treatment involves removing or imputing such values. One popular approach to identify outliers is IQR; that is, data points that lie 1.5 times of IQR above Q3 (third quartile) and below Q1 (first quartile) are outliers. Take a read on Detecting and Treating Outliers. We will leave it as an optional exercise for you to identify outliers using IQR, and replace the outliers with the median.
# identify outliers using IQR
df_train.describe()
Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | week | month | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 | 8,182.00 |
mean | 1.00 | 44.66 | 21,900.35 | 68.19 | 3.22 | 6,828.42 | 973.07 | 454.66 | 2,487.43 | 4,575.73 | 216.00 | 7.61 | 151,315.00 | 25.91 | 6.47 | 2,010.96 |
std | 0.00 | 29.97 | 27,978.36 | 14.16 | 0.43 | 3,981.30 | 4,556.17 | 4,679.52 | 3,611.66 | 2,004.60 | 4.33 | 0.38 | 0.00 | 14.19 | 3.25 | 0.80 |
min | 1.00 | 1.00 | 0.02 | 35.40 | 2.51 | 410.31 | 0.50 | 0.25 | 8.00 | 554.92 | 210.34 | 6.57 | 151,315.00 | 1.00 | 1.00 | 2,010.00 |
25% | 1.00 | 20.00 | 3,527.13 | 57.79 | 2.76 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 211.57 | 7.35 | 151,315.00 | 14.00 | 4.00 | 2,010.00 |
50% | 1.00 | 38.00 | 10,373.65 | 69.64 | 3.29 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 215.46 | 7.79 | 151,315.00 | 26.00 | 6.00 | 2,011.00 |
75% | 1.00 | 72.00 | 31,666.22 | 80.48 | 3.59 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 220.64 | 7.84 | 151,315.00 | 38.00 | 9.00 | 2,012.00 |
max | 1.00 | 99.00 | 203,670.47 | 91.65 | 3.91 | 34,577.06 | 46,011.38 | 55,805.51 | 32,403.87 | 20,475.32 | 223.44 | 8.11 | 151,315.00 | 52.00 | 12.00 | 2,012.00 |
#Calcualte IQR
#find out filtered IQR
Q1, Q3 = df['Weekly_Sales'].quantile([0.25,0.75])
IQR = Q3 - Q1
df_filtered = df_train.query('(@Q1 - 1.5 * @IQR) <= Weekly_Sales <= (@Q3 + 1.5 * @IQR)')
df_filtered.shape
df_train.shape
(7695, 23)
(8182, 23)
#outliers percentage
olp = 100* (df_train.shape[0] - df_filtered.shape[0])/df_train.shape[0]
print("outliers percentage is ", round(olp,2), "%")
outliers percentage is 5.95 %
df_train.join(df_filtered, rsuffix='_df_filtered').boxplot()
<AxesSubplot:>
We can checkout the distribution of filtered data. Filtered data don't include outliers.
plt.figure(figsize=figsize)
for i in range(len(md_cols)):
plt.subplot(1,len(md_cols),i+1)
sns.distplot(df_filtered[md_cols[i]],
hist_kws=dict(linewidth=2),
bins=10,
color = colors[i])
plt.tight_layout();
Then we can Check distibutions of outliers we need to investigate the outliers with busniess goals or domain knowledge to understand them better.
df_ole = df_train.query('(@Q1 - 1.5 * @IQR) >Weekly_Sales ' and 'Weekly_Sales > (@Q3 + 1.5 * @IQR)')
df_ole.describe()
Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | week | month | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 | 487.00 |
mean | 1.00 | 83.57 | 105,977.39 | 67.91 | 3.26 | 7,047.70 | 1,015.99 | 636.11 | 2,674.84 | 4,618.88 | 216.41 | 7.58 | 151,315.00 | 25.78 | 6.47 | 2,011.04 |
std | 0.00 | 20.80 | 26,097.23 | 14.07 | 0.42 | 4,479.65 | 4,370.99 | 5,626.98 | 3,958.21 | 2,087.97 | 4.33 | 0.39 | 0.00 | 14.32 | 3.27 | 0.79 |
min | 1.00 | 5.00 | 73,460.08 | 35.40 | 2.51 | 410.31 | 0.50 | 0.25 | 8.00 | 554.92 | 210.34 | 6.57 | 151,315.00 | 1.00 | 1.00 | 2,010.00 |
25% | 1.00 | 90.00 | 81,580.79 | 57.36 | 2.83 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 211.67 | 7.35 | 151,315.00 | 14.00 | 4.00 | 2,010.00 |
50% | 1.00 | 92.00 | 103,724.16 | 69.31 | 3.35 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 215.73 | 7.74 | 151,315.00 | 25.00 | 6.00 | 2,011.00 |
75% | 1.00 | 94.50 | 125,745.54 | 80.43 | 3.62 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 221.21 | 7.84 | 151,315.00 | 37.00 | 9.00 | 2,012.00 |
max | 1.00 | 95.00 | 203,670.47 | 91.65 | 3.91 | 34,577.06 | 46,011.38 | 55,805.51 | 32,403.87 | 20,475.32 | 223.44 | 8.11 | 151,315.00 | 52.00 | 12.00 | 2,012.00 |
sns.distplot(df_ole[target])
<AxesSubplot:xlabel='Weekly_Sales', ylabel='Density'>
sns.scatterplot(data=df_ole[df_ole.year==2011], x = 'Dept', y= target, hue='IsHoliday');
df_ole['Dept'].unique()
array([94, 90, 38, 95, 92, 93, 5, 72, 7, 91])
df_ole.groupby(by=['Dept']).sum().sort_values(by=['Store'], ascending=False)
Store | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | week | month | year | superbowl | laborday | thanksgiving | christmas | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dept | ||||||||||||||||||||
95 | 122 | 14,728,369.80 | 10 | 8,304.16 | 392.93 | 837,327.37 | 131,607.70 | 63,191.29 | 322,919.49 | 567,357.05 | 26,351.86 | 928.42 | 18460430 | 3200 | 799 | 245337 | 3 | 3 | 2 | 2 |
92 | 117 | 15,817,584.55 | 10 | 8,163.62 | 380.05 | 817,194.75 | 124,652.98 | 62,735.69 | 290,884.83 | 542,574.48 | 25,316.28 | 887.37 | 17703855 | 3093 | 769 | 235289 | 3 | 3 | 2 | 2 |
90 | 100 | 8,381,253.47 | 8 | 6,812.87 | 330.56 | 686,372.40 | 116,417.67 | 6,067.00 | 258,023.81 | 461,459.54 | 21,659.99 | 757.50 | 15131500 | 2503 | 630 | 201110 | 3 | 3 | 1 | 1 |
38 | 70 | 5,987,639.00 | 4 | 4,599.01 | 223.54 | 525,951.74 | 45,614.65 | 4,762.87 | 217,720.97 | 341,461.78 | 15,111.91 | 530.77 | 10592050 | 1548 | 398 | 140768 | 2 | 2 | 0 | 0 |
93 | 55 | 4,427,566.85 | 7 | 3,771.89 | 184.87 | 414,201.03 | 55,972.40 | 59,312.55 | 152,129.69 | 243,611.09 | 11,967.60 | 412.84 | 8322325 | 1525 | 385 | 110619 | 2 | 3 | 2 | 0 |
91 | 7 | 526,231.95 | 2 | 526.68 | 24.60 | 67,342.49 | 7,666.63 | 495.04 | 32,445.38 | 35,364.36 | 1,551.49 | 49.50 | 1059205 | 162 | 42 | 14084 | 1 | 1 | 0 | 0 |
72 | 6 | 794,329.92 | 3 | 305.77 | 17.09 | 27,752.23 | 689.92 | 56,561.13 | 7,323.10 | 20,595.11 | 1,283.32 | 47.62 | 907890 | 207 | 50 | 12062 | 1 | 0 | 2 | 0 |
94 | 5 | 400,797.20 | 0 | 324.37 | 17.05 | 34,489.89 | 11,613.71 | 130.72 | 15,699.51 | 20,703.42 | 1,075.27 | 38.26 | 756575 | 67 | 18 | 10055 | 0 | 0 | 0 | 0 |
7 | 3 | 382,197.99 | 0 | 150.13 | 8.87 | 15,033.64 | 316.20 | 695.16 | 3,670.00 | 11,389.81 | 642.17 | 23.54 | 453945 | 152 | 36 | 6031 | 0 | 0 | 0 | 0 |
5 | 2 | 165,016.24 | 1 | 112.47 | 6.12 | 6,564.45 | 235.86 | 55,835.74 | 1,830.55 | 4,880.11 | 429.87 | 15.70 | 302630 | 98 | 23 | 4021 | 0 | 0 | 1 | 0 |
--Conclusion-1:outlier and department. Above outlier analysis with IQR lead us to know more about which department has these weekly sales outliers. Fiver departments' data contibute to these outliers: 95, 92, 90, 38.
#explore weekly sales and weeks
plt.figure(figsize=(12, 6))
sns.lineplot(data=df_ole, x="week", y="Weekly_Sales", style='year');
plt.figure(figsize=(12, 6))
sns.lineplot(data=df_ole, x="month", y="Weekly_Sales", style='year');
categoricalFeatures
['year', 'month', 'week', 'IsHoliday', 'Dept']
#Visualising the outlier data categorical features
print('\033[1mVisualising Outlier Categorical Features:'.center(100))
plt.figure(figsize=(12,12))
for i in range(len(categoricalFeatures)):
plt.subplot(6,1,i+1)
sns.countplot(df_ole[categoricalFeatures[i]])
plt.tight_layout();
Visualising Outlier Categorical Features:
--Conclusion 2: outlier seasons. There are no obvious seasonality for these outlier data. And most of outliers originate from non-holiday transactions. The count/data have little variation from year-to-year.
Now let's see how we normalize the data. For numerical features it means scaling the features to be of similar range. This step is crucial for machine learning algorithms that calculate distances between data (e.g., read The Importance of Feature Scaling.
For this task, of the external features, let's keep Temperature since it is the most linearly correlated with the target variable, though very weak and negative ( feature selection ). In addition, we include one markdown field. Since neither seems to follow normal distributions, it is safer to use MinMaxScaler
from sklearn.preprocessing
to transform features by scaling each feature to a given range (See discussion on Normalization vs Standardization)
from sklearn.preprocessing import MinMaxScaler
numericalFeatures = ['Temperature', 'MarkDown1']
df_train_num = df_train[numericalFeatures]
df_train_num.describe() # Check the summary statistics
Temperature | MarkDown1 | |
---|---|---|
count | 8,182.00 | 8,182.00 |
mean | 68.19 | 6,828.42 |
std | 14.16 | 3,981.30 |
min | 35.40 | 410.31 |
25% | 57.79 | 6,154.14 |
50% | 69.64 | 6,154.14 |
75% | 80.48 | 6,154.14 |
max | 91.65 | 34,577.06 |
Instantiate a MinMaxScaler and fit using df_train_num
:
scaler = MinMaxScaler()
print(scaler.fit(df_train_num))
MinMaxScaler()
Now transform training data df_train_num
and store the resulting nparray in train_norm
:
train_norm = scaler.transform(df_train_num)
Verify that both columns now have minimum 0 and maximum 1.
pd.DataFrame(train_norm, columns=df_train_num.columns).describe()
Temperature | MarkDown1 | |
---|---|---|
count | 8,182.00 | 8,182.00 |
mean | 0.58 | 0.19 |
std | 0.25 | 0.12 |
min | 0.00 | 0.00 |
25% | 0.40 | 0.17 |
50% | 0.61 | 0.17 |
75% | 0.80 | 0.17 |
max | 1.00 | 1.00 |
# Expected Output:
Let's turn to categorical fatures. So far most, if not all Python packages for modeling do not accept strings as input; thus encoding the categorical value to numerical value is a necessary step. Here, let's apply one-hot encoding on Dept
and IsHoliday
:
from sklearn.preprocessing import OneHotEncoder
categoricalFeatures = ['Dept', 'IsHoliday']
df_train_cat = df_train[categoricalFeatures]
ohe = OneHotEncoder(handle_unknown='ignore',sparse = False).fit(df_train_cat)
Transform the categorical features using one hote encoding ohe
.
train_ohe = ohe.transform(df_train_cat)
train_ohe.shape, df_train_cat.shape # Expected Output: ((8182, 79), (8182, 2))
((8182, 79), (8182, 2))
The number of columns explodes from 2 to 79.
Lastly we merge the processed numerical features with the processed categorical features using hstack
in numpy
:
import numpy as np
X_train = np.hstack([train_norm, train_ohe])
X_train.shape # sanity check: (8182, 81)
(8182, 81)
What about the test data? Yes you need to apply the same treatments. We spare some copy + paste + edit and see how this can be done when we introduce pipeline
next.
Even with less than 20 features in our dataset, there are many many possibilities that you can preprocessing the data. There is no one-fits-all approach; often you will find yourself experimenting with many combinations to achieve better modelling performance: Should I apply normalization or standardization? Do I remove the outliers or should I impute them? Do I impute the missing values with median or mean or 0? Answers to many of these questions is "It depends." (Have you heard Graduate Student Descent?) That means trial-and-error and it is not efficient to produce a notebook each time when you need to try something slightly different. You will get lost quickly. Pipeline is one useful tool.
Not only does Pipeline help streamline the process, keep the code modular, but also reduces the possibility of introducing errors/bugs. In this task, we build the pipeline following the strategies used in the last task, run a simple linear regression model, and print out the model's performance. Note there is minimal code required for you to implement, the key is to understand each step.
To avoid confusion, let's read the data again directly from train-store1.csv
.
df = pd.read_csv('../dat/train-store1.csv')
df.shape
(10244, 16)
df.describe()
Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 10,244.00 | 10,244.00 | 10,244.00 | 10,244.00 | 10,244.00 | 3,657.00 | 3,015.00 | 3,588.00 | 3,657.00 | 3,657.00 | 10,244.00 | 10,244.00 | 10,244.00 |
mean | 1.00 | 44.39 | 21,710.54 | 68.22 | 3.22 | 8,086.73 | 2,967.16 | 1,245.40 | 3,755.21 | 5,027.75 | 216.00 | 7.61 | 151,315.00 |
std | 0.00 | 29.87 | 27,748.95 | 14.20 | 0.43 | 6,542.42 | 7,911.85 | 7,867.46 | 5,950.68 | 3,267.92 | 4.34 | 0.38 | 0.00 |
min | 1.00 | 1.00 | -863.00 | 35.40 | 2.51 | 410.31 | 0.50 | 0.25 | 8.00 | 554.92 | 210.34 | 6.57 | 151,315.00 |
25% | 1.00 | 20.00 | 3,465.62 | 57.79 | 2.76 | 4,039.39 | 40.48 | 6.00 | 577.14 | 3,127.88 | 211.53 | 7.35 | 151,315.00 |
50% | 1.00 | 38.00 | 10,289.38 | 69.64 | 3.29 | 6,154.14 | 137.86 | 30.23 | 1,822.55 | 4,325.19 | 215.46 | 7.79 | 151,315.00 |
75% | 1.00 | 72.00 | 31,452.96 | 80.48 | 3.59 | 10,121.97 | 1,569.00 | 101.64 | 3,750.59 | 6,222.25 | 220.64 | 7.84 | 151,315.00 |
max | 1.00 | 99.00 | 203,670.47 | 91.65 | 3.91 | 34,577.06 | 46,011.38 | 55,805.51 | 32,403.87 | 20,475.32 | 223.44 | 8.11 | 151,315.00 |
Separating the target y
from the features X
:
X, y = df.drop(columns=target), df[target]
Import Pipeline
from submodule sklearn.pipeline
from sklearn.pipeline import Pipeline
Now we build a transformer for numerical features following two steps: impute the missing values with the feature median (use SimpleImputer
), followed by normalization (use MinMaxScaler
)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
numeric_features = ['CPI', 'MarkDown1']
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("normalization", MinMaxScaler())
])
For categorical features, we apply one hot encoding OneHotEncoder
( there are many other options; see Scikit-learn documentation ):
categorical_features = ['Dept', 'IsHoliday']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
Piece the numeric_transformer
and categorical_transformer
using ColumnTransformer
:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
Lastly, let's append the regression model to preprocessing pipeline to complete a full prediction pipeline.
from sklearn.linear_model import LinearRegression
model = Pipeline(
steps=[("preprocessor", preprocessor), ("model", LinearRegression())]
)
The pipepline has been built! The rest is to
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Let's run the prediccction
model.fit(X_train, y_train)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('normalization', MinMaxScaler())]), ['CPI', 'MarkDown1']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Dept', 'IsHoliday'])])), ('model', LinearRegression())])
print("model score: %.3f" % model.score(X_test, y_test))
model score: 0.949
Optional: Discuss what type of Feature Selection strategy you would use to select the features.
Can we predict the weekly sales with the given data?
data = df.drop(columns=categorical_features)
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7fb3f9bd0880>
weekly sales don't have correlations with numerical data. we may try tree method, here, Random Forest to predit the sale.
X.columns
Index(['Store', 'Dept', 'Date', 'IsHoliday', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Type', 'Size'], dtype='object')
from sklearn.ensemble import RandomForestClassifier
numeric_features_2 = ['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'Unemployment']
numeric_transformer_2 = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ("normalization", MinMaxScaler()) ])
preprocessor_2 = ColumnTransformer( transformers=[ ("num", numeric_transformer_2, numeric_features_2), ("cat", categorical_transformer, categorical_features), ] )
model_2 = Pipeline( steps=[("preprocessor", preprocessor_2), ("model", RandomForestClassifier(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=80, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False))] )
model_2.fit(X_train, y_train) print("model score: %.3f" % model_2.score(X_test, y_test))