Fraud Detection on credit card trans

TEAM GOAL MODELS TOOLS

Michel Zou -> Identify the potential fraud within Logistic Regression R Studio

Jiang Hao credit card transactions and find the pattern Bagging Tree & Random Forest

Yung -Yu Lin -> Find the best fraud detection model Gradient Boosting Tree

Keng-Chu Lin which is stable and productive Support Vector Machine

Neural Network

In this project, our team worked on building a supervised learning model that makes fraud prediction based on credit card payment transaction dataset. The supervised model could be used to detect lost/stolen cards or fraudulent transactions made by merchant or cardholder.

#Summary of dataset

The credit card transaction dataset has information about time, location, merchant, transaction type, and dollar amount for each transaction in the year of 2010. The entire data set has more that ninety thousand records and contains missing values. Except credit card information, this data set also has fraud label for each transactions - 0/1.

#recovering missing values

We noticed that there are missing values in the field MERCHNUM. The amount of missing values is 3351. In order to analyze frauds as precise as possible, we believed that we need to find a way to recover those missing values.
We decided to create a new variable, named MERCHUID, meaning Unique ID for Merchants. We designed an algorithm to generate this MERCHUID: MERCHUID equals the 3-digit of MERCHZIP plus the MD5 hash value of MERCHDESCRIPTION. The reason why we took 3-digit of MERCHZIP is because we assumed there are some stores with the same store description but in the different locations. Then we assign values in the new variable MERCHUID by the following rule: using MERCHNUM if it exists, otherwise generating MERCHUID with our algorithm.

#creating variables

We started by finding entities in the credit card data, which is the beginning of creating special variables. We thought of account/cardnumber, geography, merchant (number or name), day of week, day of month, month of year, $ amount. Next we thought through some potential fraud behavior that we might see at some of these entity levels, and this leads us to think of signals where we might observe that behavior.

We considered behavior such as strange activities at some of these entity levels, in particular

card number and merchant number. At these two entity levels we could look for unusual $ transactions, unusual time bursts of activity, unusual geographic activity etc. With these concepts considered we can start building potential variables that might be signals of fraud. Some variables we started sketching out include:

At either/both account # or merchant # levels,

#transactions in last n days (try n = 1, 2, 3, 7, …) over 90 days time window
average $ amount in last n days over 90 days time window
is this a new zip codes considering the last n transactions or time window? (0/1 binary variable)

In this project, we only focus on first two speical vairables. We are going to build 16 variables about #transactions and $ amount in last n (n=1,2,3,7) days on both of card and merchant levels.

Since those new variables have different granularities, our algorithm takes

z-score scaling to adjust all normalized values within each variables. To be more specific, the values we get in the past 90 days will potentially bigger than those in 45 days in nature. Thus we need to re-calculate all values within 16 variables for putting all values in the same scale.

The following are 16 the variables we used in our model:

z-scored #trans in past 1/2/3/7 day on card level
z-scored $amount in past 1/2/3/7 day on card level
z-scored #trans in past 1/2/3/7 day on merchant level
z-scored $amount in past 1/2/3/7 day on merchant level

Sample R code of creating variables

#SEPARATING data set

We separate the entire credit card transactions data into three part: training data, testing data and out of time validation. Training data is used to build up the model; testing data is used to test the accuracy of the model. Out of time validation data set aims to test how this model works in the future. There are 95271 observations in the entire data set covering the transactions from January 2010 to December 2010. We use transactions from January 1 to August 31 as training and testing data, and transaction from September 1 to the end of December as out of time validation. Within the data from January 1 to August 31, we randomly select 80% of them as training data, and the remaining 20% as testing data. The detail of the data separation is shown in the following table:

#model comparison

We used 6 different modeling techniques with previous 16 new variabels. The object of our model is to predict the probability of fraud for each transactions. The benchmark for model comparison is fraud detection rate at centain percentile, which means to rank the output from the highest poential fraud to the lowest, and compute fraud detection rate at centain percentile (in other words, true positive rate at top 3%/10%/20% potential fraud transactions). The rationale we are using this benchmark, instead of tranditional statistical benchmark, is that in the real fraud detection business, company might not have time/computing power to detect all card transactoins but is able to discover few top potential frauds. The process of our model comparison is to build the model with training data, select the best model using testing data, and test how this model works in the future using validation data. *fraud detection rate = #true positive/ (#true positive + #true negative )

The following chart shows the performance of different modeling techniques - fraud detection rate at 3 percentile, 10 percentile and 20 percentile, when we use testing data to test them out. Thinking from fraud detection business standpoint, there is a conflict between marketing efforts and risk management. In other words, getting more card transactions and increasing customer base also would increase the risk of fruad activity. Where to set the cutoff? How many potential fraud should we detect? In this project, we are going to focus on top 3 percentile potential fraud transactions. Looking at fraud detection rate at 3 percentile, random forest is the best technique.

#modeling details & out-of-time test

random forest

Random forest has the best result, 39% fraud detection rate, when company only detects top 3% potential fraud. When we using out-of-time validation data (from August to Dec.) to test this model, it keeps roughly consistent result with testing data. Looking at variables importance chart, random forest model shows that unnormal dollar amounts/ number of transactions within 7 days at credit card level are top 2 important factors for detecting fraud.

The chart below futher explains the performance of this model when testing on out-of-time validation data. We separate the entire records into 100 small parts ordering by the probability of fraud. We want to know if we detect one more group how many extra fraud we can discover.