GIVE ME SOME CREDIT


team                            GOAL                             MODELS              TOOLS    

Wenjie Xu                                                                      Predicting the loan default                                            QDA & LDA                                                     R Studio

Michel Zou                                                                                                                                                            Logistic Regression

Stephanie Wang                                                                                                                                                   KNN

Keng-chu Lin                                                                                                                                                         Decision Tree & Random Forest

Luou Meng                                                                                                                                                            ROC Curve


 

#to build a model that predicts the probability of default to help banks determine whether or not a loan should be granted

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two year.
The goal of this project is to build a model that borrowers can use to help make the best financial decisions .
This is a Kaggle competition: kaggle.com/c/GiveMeSomeCredit

 

#ANALYSIS Methodology

#data set analysis

Total Observations: 150,000

Training Data: 60,000

Testing Data: 90,000

Data Source: Kaggle.com

% of defaulter instances: 6.68%


#Dependent Variable

1. Serious Delinquent in 2 Years

#INDEPENDENT VARIABLES

  1. Debt Ratio 

  2. Monthly Income

  3. Revolving Utilization

  4. Borrower’s Age

  5. Number of Dependents

  6. Number of Open Credit Loans

  7. Number of Real Estate Loans

  8. Number of Time 30-59 Days Past Due Not Worse in the Past 2 Years

  9. Number of Time 60-89 Days Past Due Not Worse in the Past 2 Years

  10. Number of Times 90 Days Late


 

#STATISTICAL MODELS & DATA MINING

1. KNN

2. Logistic regression

 

3. decision tree

4. random forest

 

5. MODELS COMPARISOn

 

6. ROC CURVE

 
 

based on the roc curve, random forest is better

 

 

# CONCLUSION

  • By misclassification rate, Random Forest is better.

  • By number of true positives, Decision Tree is better.

  • By ROC, Random Forest is better.

 

To determine the best model in reality, there are many things to consider:

❏ business goal of the bank
❏ true positive (successfully rejecting clients that will possibly default)
❏ false positive (rejecting quality clients, resulting in loss)
❏ true negative (obtaining quality clients)
❏ false negative (obtaining clients that will possibly default, resulting in loss)