Credit Card Fraud Exploration

This project will focus on credit card fraud activity. Fraudulent detection is one of toughest challenges due to imbalanced data, irregular identifiable patterns, missing features, and live transactions. Creating a model with live streaming data, learn the live transaction data, update transaction pattern, and identify anomaly is pertinent in many areas.

Background: Business Objectives

It was reported that Federal Trade Commission received 2.8 million fraud reports from consumers in 2021. Consumers loss reached $5.8 billion which is 70% higher than 2020. Fraudsters are using more advanced techniques, such as machine learning, to target new customers, online transactions, and stealing identities. Currently, many models have been proposed to improve the fraud detection including KNN, logistic regression, SVM etc. For data preprocessing, data under-sampling, over-sampling, feature selection (PCA, logistic regression, SVM) have been widely used. There is report that credit card fraud detection recall can reach 0.94. However, based on the previous year’s report, fraudulent activities increase more and more. Fraudsters are using machine learning techniques to avoid defence machine learning algorithms. Simply label outliers or defining outliers are not satisfying the needs to identify attacking pattern. A platform that contains data streaming, data preprocessing (feature selection, auto-labeling, grouping), model selection, model training, model relearn based on live transaction data, and prediction is highly needed. The data-model live interaction will facilitate the model selection and updating, which will further enhance the anomaly detection speed.

Part I. Data Exploration

check the basic info of the dataset

I - A. Data wrangling

  1. Are there any null values in the dataset? ### Impute data set.

I - B. Imbalanced class handling

  1. How many fraud transactions in the dataset? Imbalanced data handling. ### Use undersampling and oversampling techinique to handle the imbalanced dataset.

Good news! There are no null values in the data set.

Explore several questions:

1. What is the amount distribution of the fraud activities?

2. How fraud amount distribution correlated with all other normal transactions?

3. Any correlations among v1 to v27?

4. Time distribution for fraud and normal transactions.

Let's visualize all numerical features in both density plot and box plot. Note any observations.

Split data in to X, y

remove highly correlated data

Basic model selection

Make a pipeline to scale data.

Choose different models: Logistic Regression, SVM, Random Forest, Gradient Boost, XGBoost classifiers; train the model and compare their performance.

Performance evaluation with Confusion matrix, accuracy, precision, recall, and F1 score.

XGBoost showed the best precision and recall.

Use autoML to find an optimized model