- A+
银行贷款 — 信用违约
1. 定义
* 银行与借款人之间的协议 —- 贷款+按揭偿还本金和利息
2. 预期损失 Expected loss (EL)
- 组成:
- Probability of default (PD)
- Exposure at default (EAD)
- Loss given default (LGD): as % of EAD
- 公式:EL = PD EAD LGD
3. 银行用来分析信用风险的信息
- 申请表信息
- 收入
- 婚姻状况
- 申请人信息
- 目前账户余额
- 应付欠款历史记录
4. 实战数据介绍
```{r, echo=FALSE, message=FALSE}
library(foreign)
library(gmodels)
loan.data = read.spss(“Loan_ROC.sav”, to.data.frame=TRUE)
```{r}
head(loan.data,10)
CrossTable(loan.data$education)
CrossTable(loan.data$education, loan.data$Loan, prop.r = T, prop.c = F, prop.t = F, prop.chisq = F)
- There are
r nrow(loan.data)
observations. - Row-wise proportion of default by education category (注意反常识的结果,推断可能原因)
5. 数据分析 — 直方图 + 异常值
1. 直方图
hist(loan.data$debt_income, main = 'Histogram of debt-to-income ratio (*100)', xlab = 'Debt-to-income ratio')
hist(loan.data$income, main = 'Histogram of household income in thousands', xlab = 'Income')
hist(loan.data$income, main = 'Histogram of household income in thousands', xlab = 'Income')$breaks
hist(loan.data$income, breaks = sqrt(nrow(loan.data)), main = 'Histogram of income with breaks argument', xlab = 'Income')
2. 异常值
plot(loan.data$income,col=ifelse(loan.data$income>=150,"orange","black"), pch=ifelse(loan.data$income>=150,19,1), ylab = 'Income')
- 异常值判断
- 专业领域判断
- 常规判断:Q1-1.5*IQR —- Q3+1.5*IQR
- 两者兼具
outlier1 = which(loan.data$income>150)
data.nooutlier1 = loan.data[-outlier1,]
outlier.cutoff = quantile(loan.data$income, 0.75) + 1.5 * IQR(loan.data$income)
outlier2 = which(loan.data$income>outlier.cutoff)
data.nooutlier2 = loan.data[-outlier2,]
hist(data.nooutlier1$income, breaks = sqrt(nrow(data.nooutlier1)), main = 'Histogram of income without outliers', xlab = 'Income')
plot(loan.data$year_emp, loan.data$income, col=ifelse(loan.data$income>=150,"orange","black"), pch=ifelse(loan.data$income>=150,19,1), main = 'Bivariate plot', xlab = 'Years with current employer', ylab = 'Income')
- Bivariate scatterplot helps to check outliers on two-dimentional variables
6. 数据分析 — 缺失值
sapply(loan.data, function(x) all(!is.na(x)))
temp = loan.data
set.seed(10)
temp$year_emp[sample(1:nrow(loan.data),10)] = NA
summary(temp$year_emp)
na.index = which(is.na(temp$year_emp))
data.nona1 = temp[-na.index,]
data.nona2 = temp
data.nona2$year_emp[na.index] = median(temp$year_emp, na.rm=T)
temp$year_emp_cat <- rep(NA, length(temp$year_emp))
temp$year_emp_cat[which(temp$year_emp <= 5)] <- "0-5"
temp$year_emp_cat[which(temp$year_emp > 5 & temp$year_emp <= 10)] <- "5-10"
temp$year_emp_cat[which(temp$year_emp > 10 & temp$year_emp <= 15)] <- "10-15"
temp$year_emp_cat[which(temp$year_emp > 15 & temp$year_emp <= 20)] <- "15-20"
temp$year_emp_cat[which(temp$year_emp > 20)] <- "20+"
temp$year_emp_cat[which(is.na(temp$year_emp))] <- "Missing"
temp$year_emp_cat <- factor(temp$year_emp_cat, levels = c("0-5","5-10","10-15","15-20","20+","Missing"))
plot(temp$year_emp_cat)
- 处理方法(也可用于处理异常值):
- 删除行/列(适用于连续型和分类型数据)
- 替换 (4个主要计算方法,如连续型数据用
median
,分类型数据用频率最高的类别,待以后文章详解) - 保留 (连续型数据用
bin
分类;分类型数据创建NA
类别)
7. 数据模型检测 — training/test set + confusion matrix
- Accuracy = (TN+TP)/(TN+TP+FP+FN)
- Sensitivity = TP/(TP+FN)
- Specificity = TN/(TN+FP)
set.seed(1)
train.index <- sample(1:nrow(loan.data), 2/3 * nrow(loan.data))
training <- loan.data[train.index, ]
test <- loan.data[-train.index, ]
#table(test$Loan, model_pred)
数据人网作者和讲师:李悦
纽约大学硕士毕业,专业金融传媒,就职于纽约一家卖方投资研究机构,数据分析师,特许金融分析师(CFA)
支付宝打赏
微信打赏
赏