金融信用风险建模之 R 实战案例

  • A+

金融信用风险建模之 R 实战案例

来源数据人网:http://www.shujuren.org/article/113.html

Cynthia Li, CFA

银行贷款 — 信用违约

1. 定义

* 银行与借款人之间的协议 —- 贷款+按揭偿还本金和利息

2. 预期损失 Expected loss (EL)

  • 组成:
    • Probability of default (PD)
    • Exposure at default (EAD)
    • Loss given default (LGD): as % of EAD
  • 公式:EL = PD EAD LGD

3. 银行用来分析信用风险的信息

  • 申请表信息
    • 收入
    • 婚姻状况
  • 申请人信息
    • 目前账户余额
    • 应付欠款历史记录

4. 实战数据介绍

```{r, echo=FALSE, message=FALSE}
library(foreign)
library(gmodels)
loan.data = read.spss(“Loan_ROC.sav”, to.data.frame=TRUE)

  1. ```{r}
  2. head(loan.data,10)
  3. CrossTable(loan.data$education)
  4. CrossTable(loan.data$education, loan.data$Loan, prop.r = T, prop.c = F, prop.t = F, prop.chisq = F)
  • There are r nrow(loan.data) observations.
  • Row-wise proportion of default by education category (注意反常识的结果,推断可能原因)

5. 数据分析 — 直方图 + 异常值

1. 直方图

  1. hist(loan.data$debt_income, main = 'Histogram of debt-to-income ratio (*100)', xlab = 'Debt-to-income ratio')
  2. hist(loan.data$income, main = 'Histogram of household income in thousands', xlab = 'Income')
  3. hist(loan.data$income, main = 'Histogram of household income in thousands', xlab = 'Income')$breaks
  4. hist(loan.data$income, breaks = sqrt(nrow(loan.data)), main = 'Histogram of income with breaks argument', xlab = 'Income')

2. 异常值

  1. plot(loan.data$income,col=ifelse(loan.data$income>=150,"orange","black"), pch=ifelse(loan.data$income>=150,19,1), ylab = 'Income')
  • 异常值判断
    • 专业领域判断
    • 常规判断:Q1-1.5*IQR —- Q3+1.5*IQR
    • 两者兼具
  1. outlier1 = which(loan.data$income>150)
  2. data.nooutlier1 = loan.data[-outlier1,]
  3. outlier.cutoff = quantile(loan.data$income, 0.75) + 1.5 * IQR(loan.data$income)
  4. outlier2 = which(loan.data$income>outlier.cutoff)
  5. data.nooutlier2 = loan.data[-outlier2,]
  6. hist(data.nooutlier1$income, breaks = sqrt(nrow(data.nooutlier1)), main = 'Histogram of income without outliers', xlab = 'Income')
  7. plot(loan.data$year_emp, loan.data$income, col=ifelse(loan.data$income>=150,"orange","black"), pch=ifelse(loan.data$income>=150,19,1), main = 'Bivariate plot', xlab = 'Years with current employer', ylab = 'Income')
  • Bivariate scatterplot helps to check outliers on two-dimentional variables

6. 数据分析 — 缺失值

  1. sapply(loan.data, function(x) all(!is.na(x)))
  2. temp = loan.data
  3. set.seed(10)
  4. temp$year_emp[sample(1:nrow(loan.data),10)] = NA
  5. summary(temp$year_emp)
  6. na.index = which(is.na(temp$year_emp))
  7. data.nona1 = temp[-na.index,]
  8. data.nona2 = temp
  9. data.nona2$year_emp[na.index] = median(temp$year_emp, na.rm=T)
  10. temp$year_emp_cat <- rep(NA, length(temp$year_emp))
  11. temp$year_emp_cat[which(temp$year_emp <= 5)] <- "0-5"
  12. temp$year_emp_cat[which(temp$year_emp > 5 & temp$year_emp <= 10)] <- "5-10"
  13. temp$year_emp_cat[which(temp$year_emp > 10 & temp$year_emp <= 15)] <- "10-15"
  14. temp$year_emp_cat[which(temp$year_emp > 15 & temp$year_emp <= 20)] <- "15-20"
  15. temp$year_emp_cat[which(temp$year_emp > 20)] <- "20+"
  16. temp$year_emp_cat[which(is.na(temp$year_emp))] <- "Missing"
  17. temp$year_emp_cat <- factor(temp$year_emp_cat, levels = c("0-5","5-10","10-15","15-20","20+","Missing"))
  18. plot(temp$year_emp_cat)
  • 处理方法(也可用于处理异常值):
    • 删除行/列(适用于连续型和分类型数据)
    • 替换 (4个主要计算方法,如连续型数据用median,分类型数据用频率最高的类别,待以后文章详解)
    • 保留 (连续型数据用bin分类;分类型数据创建NA类别)

7. 数据模型检测 — training/test set + confusion matrix

金融信用风险建模之 R 实战案例

  • Accuracy = (TN+TP)/(TN+TP+FP+FN)
  • Sensitivity = TP/(TP+FN)
  • Specificity = TN/(TN+FP)
    1. set.seed(1)
    2. train.index <- sample(1:nrow(loan.data), 2/3 * nrow(loan.data))
    3. training <- loan.data[train.index, ]
    4. test <- loan.data[-train.index, ]
    5. #table(test$Loan, model_pred)

数据人网作者和讲师:李悦
纽约大学硕士毕业,专业金融传媒,就职于纽约一家卖方投资研究机构,数据分析师,特许金融分析师(CFA)

南霁月
R语言实战(中文完整版)
MySQL必知必会
基于大数据的用户特征分析
数学建模教材(包括十大算法、matlab、lingo、spss、exce以及多种实例模型)

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: