Loan Approval Logistic Regression Analysis

Brian Cervantes Alvarez
Wylea Walker

Wednesday, December 4, 2024

Introduction & Background

Loan approval decisions are pivotal in the financial sector, affecting both lenders’ risk management and borrowers’ access to funds. Our project dived into identifying the key demographic and financial factors that influence loan approval outcomes.

  • Dataset: 32,581 simulated loan applications
  • Methods: Logistic regression & ANOVA
  • Key Variables: Income, Employment Length, Home Ownership, Loan Intent, Credit History

Research Question

Which demographic and financial factors are most strongly linked to an increased likelihood of loan approval?

Logistic Regression

Methods

Data Description

  • Demographics: Age, home ownership.
  • Financial: Income, employment length, loan amount, loan intent, interest rate, loan-to-income ratio.
  • Credit History: Default status, credit history length, credit score.
  • Outcome: Loan approval status.

Data Preparation

  • Converted categorical variables to factors.
  • Ran a correlation matrix
  • Excluded person_age and loan_grade to avoid multicollinearity.

Statistical Modeling: Logistic regression to model loan approval probability.

Multicollinearity

Model Specification

We employed logistic regression to model the probability of loan approval based on predictor variables. The logistic regression model is defined as:

\log\left(\frac{P(Y = 1)}{1 - P(Y = 1)}\right) = \beta_0 + \sum_{i=1}^{p} \beta_i X_i,

where Y is the loan approval status, \beta_0 is the intercept, \beta_i are the coefficients, and X_i are the predictor variables.

The fitted logistic regression model is:

\begin{align*} \log\left(\frac{P(\text{loan\_status} = 1)}{1 - P(\text{loan\_status} = 1)}\right) = & \beta_0 + \beta_1 \cdot \text{person\_income} + \beta_2 \cdot \text{person\_home\_ownership} \\ & + \beta_3 \cdot \text{person\_emp\_length} \\ & + \beta_4 \cdot \text{loan\_intent} + \beta_5 \cdot \text{loan\_amnt} + \beta_6 \cdot \text{loan\_int\_rate} \\ & + \beta_7 \cdot \text{loan\_percent\_income} + \beta_8 \cdot \text{cb\_person\_default\_on\_file} \\ & + \beta_9 \cdot \text{cb\_person\_cred\_hist\_length} + \beta_{10} \cdot \text{credit\_score}. \end{align*}

Model Fit

Variable GVIF Df GVIF^(1/(2*Df))
person_income 1.227036 1 1.107717
person_home_ownership 1.151364 3 1.023769
person_emp_length 1.076528 1 1.037559
loan_intent 1.058406 5 1.005693
loan_amnt 2.112565 1 1.453466
loan_int_rate 1.351832 1 1.162683
loan_percent_income 2.115208 1 1.454375
cb_person_default_on_file 1.256489 1 1.120932
cb_person_cred_hist_length 1.027444 1 1.013629

VIF Values: All below 5 → No multicollinearity issues.

Results

Significant Predictors:

  • Income: Higher income ↑ approval odds.
  • Home Ownership: Renting ↑, Owning ↓ approval odds.
  • Employment Length: Longer employment ↓ approval odds.
  • Loan Amount: Higher amount ↓ approval odds.
  • Loan Intent: Most intents ↓ approval odds except Home Improvement.
  • Interest Rate & Loan-to-Income: Higher values ↑ approval odds.
  • Credit Score: Higher scores ↓ approval odds (unexpected).
  • Historical Default: Marginal effect.
  • Credit History Length: No significant effect.

Residual Analysis

ANOVA

Does Loan Interest Rates differ from Loan Intent?

  • Null Hypothesis (H_0): The mean Loan Interest Rates are the same across all Loan Intent categories.
  • Alternative Hypothesis (H_A): At least one Loan Intent category has a different mean Loan Interest Rate.
               Df Sum Sq Mean Sq F value  Pr(>F)   
loan_intent     5    178   35.69     3.4 0.00451 **
Residuals   29459 309211   10.50                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3116 observations deleted due to missingness

Violin|Box Plots

Does Loan Percent Income Differ by Home Ownership?

  • Null Hypothesis (H_0): The mean Loan Percent Income is the same across all Home Ownership statuses.
  • Alternative Hypothesis (H_A): At least one Home Ownership status has a different mean Loan Percent Income.
                         Df Sum Sq Mean Sq F value Pr(>F)    
person_home_ownership     3    8.2  2.7494   246.6 <2e-16 ***
Residuals             32577  363.2  0.0112                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Violin|Box Plots

Conclusion

Key Findings: Higher income and renting increase loan approval odds; higher credit scores surprisingly decrease odds.

Practical Implications

  • For Lenders: Refine criteria focusing on income and loan amounts.
  • For Applicants: Highlight income and request moderate loan amounts.

Limitations: Unexpected credit score relationship suggests data/model issues; dataset simulated with unknown generation process.

References