Open In Colab

Import modules, authenticate google drive access

1. Data exploration

1.1 Import data

1.1.1 Null values exploration

1.1.2 Unique values and covert to numerical values

Something interesting

  1. Only ten months data present. Which months are missing?
  2. 6 special days. Which holidays are shown here?
  3. 311 related product. Are these products all new products? What are these purchased rate?

1.1.3 split data to train and test data

Training data is June to December, Test data is Feb.-March

1.2 Use different types of feature to describe data.

1.3 Plot numeric features distribution to see if any abnormal or correlations.

1.4 Any correlations between revenue and other_factors or cat_features?

Missing Jan, and April data

Do the same for test data

  1. Month shows peak patterns:
  2. Vistor type 1 has more
  3. less weekend

1.4 Split data before feature engineering to avoid data leak issue

2. Feature engineering

2.1 Visulize correlation, remove highly correlated features

Do the same for the test data

2.2 Selected features vs revenue visulization

Informational duration show very few yes data for revenue. Other featurs show more or less the same pattern for yes and no. Imbalanced data should be taken into consideration before training models.

2.3 Scaling, normalize Adminstrative duration, infromational duration and numerical features, so they are in the same scale.

Do the same for test data

3. Data subsampling

4. Classification models

4.1 Logistic regression

test with subsampling

Test without subsampling

4.2 SVM

SVM with sub-sampling

SVM without subsampling

4.3 Random Forest and Feature importance

test with subsampling

4.4 Conclusion:

5. Performance check for models

5.1 ROC test for three different classifiers w/o subsampling

Subsampling performance

Without subsampling performance

The imbalanced class data shows a better result than the subsampling

5.2 Random Forest n estimators factor?

5.3 Conclusion:

6. Clustering with KMean, semi-supervising model

6.1 Use number of inertia and cluster to determin 'elbow'.

6.2 visulize different cluster number for different features

6.4 Hierarchy visulization of the structure

6.5 Conclusion: KMeans clustering shows a better cluster behavior when cluster number equals 4

7. Semi-supervised learning.

Use June-September data to predict October-December revenue.

use normalized values for numeric features to avoid outlier 'drawing effect'

7.1 set data for traing June-Sept., test data for oct. - Dec., and performance test data for Feb. March

7.2 # use Random Forest to train the model and predict semi_test revenue for Oct.-Dec.

7.2.1 Use SVM to predict the revenue of Oct.-Dec. Concatenate the predicted revenue with the other to form new data. We have concluded that SVM shows 100% recall for true class, so we may use it to label our data.

7.2.2 Performance test with validate data, self labeld data using Random Forest

7.2.3 Performance of original data with validate data

7.2.4 Performance using only June-September data

7.3 Conclusion: