处理缺失值的常规方法总结

发表评论
5,940 阅读

A+

所属分类：R语言工具箱数据分析数据挖掘

来源：数据人网

地址：http://www.shujuren.org/article/117.html

title: “处理缺失值的常规方法总结”

author: “Cynthia Li, CFA“

date: “May 6, 2016”

前言

现实生活中的数据是纷繁杂乱的，收集来的数据有缺失和录入错误司空见惯，所以学习如果处理这些常见问题是每一个数据人必须掌握的技能，俗话说巧妇难为无米之炊，不能很好的处理原始数据会给后来的建模带来麻烦，甚至引入不必要的偏差和错误，数据科学家都熟悉“垃圾进垃圾出”的说法，今天让我们来学习成为合格数据人的必修课 —- 处理缺失值的常规办法。

数据介绍

我们继续使用上一节的贷款数据集（数据内容请见下篇介绍）

library(foreign)
library(gmodels)
library(mice)
library(VIM)
library(Hmisc)
library(DMwR)
loan.data = read.spss("Loan_ROC.sav", to.data.frame=TRUE)
levels(loan.data$education)
loan.data$education=factor(loan.data$education)
levels(loan.data$education)

引入缺失值

set.seed(1)
data.copy = loan.data
data.copy$debt_income[sample(1:nrow(data.copy), 10)] <- NA
data.copy$education[sample(1:nrow(data.copy), 10)] <- NA
levels(data.copy$education)
data.copy$education=factor(data.copy$education)
levels(data.copy$education)

MICE 包 —— 检查缺失值分布

注意连续型数据和分类型数据的区别
两类缺失值
- MCAR: missing completely at random (desirable scenario; safe maximum threshold — 5% of total large dataset)
- MNAR: missing not at random (wise and worthwhile to check data gathering process)
检查行与列（自建公式）

summary(data.copy)
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(data.copy, 2, pMiss)
apply(data.copy, 1, pMiss)
md.pattern(data.copy)

r names(md.pattern(data.copy)[,1][1]) 个完整值
r names(md.pattern(data.copy)[,1][2]) 个观察值只缺失r names(md.pattern(data.copy)[1,][12])
r names(md.pattern(data.copy)[,1][3]) 个观察值只缺失r names(md.pattern(data.copy)[1,][13])
r names(md.pattern(data.copy)[,1][4]) 个观察值缺失前两个变量

VIM 包 + box plot 箱线图 —- 缺失值视觉化

aggr_plot <- aggr(data.copy, col=c('blue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data.copy), cex.axis=.7, gap=3, ylab=c("Missing data histogram","Missing value pattern"))
marginplot(data.copy[,c(5,2)], main='Missing value box plot\n Constrain: 2 variables')

debt_income 的红－缺失 education 的分布
debt_income 的蓝－其余不缺失 education 的分布
希望看到两个变量分别的红和蓝条分布相似（注意数据类型的不同）

1. 处理方法 —- 删除行／列（na.action=na.omit）

适用条件：

1. 足够大数据集（model doesn’t lose power）
1. 不会引入偏差 (no disproportionate or non-representation of classes)
1. 删列：不起重要预测作用的变量

2. 处理方法 —- 替换（mean / median / mode）

适用条件：

1. 不需要非常精确的估算
1. 变量 variation is low，或者 low leverage over response variable

impute(data.copy$debt_income, mean) # replace with mean, indicated by star suffix
impute(data.copy$debt_income, 20) # replace with specific number
# data.copy$debt_income[is.na(data.copy$debt_income)] <- mean(data.copy$debt_income, na.rm = T) # impute manually

计算正确率 —- 替换 mean

actuals <- loan.data$debt_income[is.na(data.copy$debt_income)]
predicteds <- rep(mean(data.copy$debt_income, na.rm=T), length(actuals))
regr.eval(actuals, predicteds)

3. 处理方法 —- 推测（高大上的方法：kNN, rpart, and mice）

3.1. kNN

# 推测方法：identify ‘k’ closest observations based on euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs.
# 优点：can impute all missing values in all variables with one call to function. It takes whole dataframe as argument and don’t have to specify which variable to impute.
# 注意：not to include response variable.

knnOutput <- knnImputation(data.copy)
anyNA(knnOutput)
actuals <- loan.data$debt_income[is.na(data.copy$debt_income)]
predicteds <- knnOutput[is.na(data.copy$debt_income), "debt_income"]
regr.eval(actuals, predicteds)

The mean absolute percentage error (mape) has improved by ~ 65% compared to the imputation by mean.

3.2. rpart

# 优点：比较于 kNN，rpart 和 mice 可以用于分类型数据；而且 rpart 仅需要一个 predictor variable to be non-NA
# 注意：not to include response variable.
# `method=class` for factor variable
# `method=anova` for numeric variable

library(rpart)
class_mod <- rpart(education ~ ., data=data.copy[!is.na(data.copy$education), ], method="class", na.action=na.omit)
anova_mod <- rpart(debt_income ~ ., data=data.copy[!is.na(data.copy$debt_income), ], method="anova", na.action=na.omit)
education_pred <- predict(class_mod, data.copy[is.na(data.copy$education), ],type='class')
debt_income_pred <- predict(anova_mod, data.copy[is.na(data.copy$debt_income), ])
actuals <- loan.data$debt_income[is.na(data.copy$debt_income)] # debt_income accuracy
predicteds <- debt_income_pred
regr.eval(actuals, predicteds)
actuals <- loan.data$education[is.na(data.copy$education)]
mean(actuals != education_pred) # misclass error for education accuracy

debt_income’s accuracy is slightly worse than kNN but still much better than imputation by mean

3.3. mice

# 方法：2-step：mice() to build multiple models; complete() to generate one/several completed data (default is first).

miceMod <- mice(data.copy[, !names(data.copy) %in% "Loan"], method="rf") # based on random forests
miceOutput <- complete(miceMod) # generate completed data.
anyNA(miceOutput)
actuals <- loan.data$debt_income[is.na(data.copy$debt_income)]
predicteds <- miceOutput[is.na(data.copy$debt_income), "debt_income"]
regr.eval(actuals, predicteds) # debt_income accuracy
actuals <- loan.data$education[is.na(data.copy$education)]
predicteds <- miceOutput[is.na(data.copy$education), "education"]
mean(actuals != predicteds) # misclass error education accuracy

The mean absolute percentage error (mape) has improved by ~ 62% compared to the imputation by mean.
Mis-classification error is the same as rpart’s 60%, which may be contributed to imbalanced classes and small sample.

miceMod <- mice(data.copy[, !names(data.copy) %in% "Loan"], method="pmm") # based on predictive mean matching
miceOutput <- complete(miceMod) # generate completed data.
anyNA(miceOutput)
actuals <- loan.data$debt_income[is.na(data.copy$debt_income)]
predicteds <- miceOutput[is.na(data.copy$debt_income), "debt_income"]
regr.eval(actuals, predicteds) # debt_income accuracy
actuals <- loan.data$education[is.na(data.copy$education)]
predicteds <- miceOutput[is.na(data.copy$education), "education"]
mean(actuals != predicteds) # misclass error education accuracy

The mean absolute percentage error (mape) has improved by ~ 61% compared to the imputation by mean.
But mis-classification error is 50%, better than previous two methods.

4. 数据视觉化检测 —- 对比原始数据和推测值

4.1. scatterplot

xyplot(miceMod,debt_income ~ age + year_emp + income, pch=18,cex=0.8)

Desirable scenario: magenta points (imputed) matches shape of blue (observed). Matching shape 说明 imputed values are indeed “plausible values”.

4.2. density plot

densityplot(miceMod)

Magenta - density of imputed data for each imputed dataset
Blue - density of observed data
Desirable: similar distributions

4.3. stripplot plot

stripplot(miceMod, pch = 20, cex = 1.2)

点状图

5. 推测处理后的数据建模 —- pooling

modelFit1 <- with(miceMod,lm(debt_income ~ age + year_emp + income))
summary(pool(modelFit1))

fmi —- fraction of missing information
lambda —- proportion of total variance that is attributable to missing data

6. 备注（额外介绍）—- 处理 random seed initialization

mice function is initialed with a specific seed, so results are dependent on initial choice. To reduce this effect, we impute a higher number of dataset, by changing default m=5 parameter in mice() function
1. tempData2 <- mice(data.copy[, !names(data.copy) %in% "Loan"], method='pmm',m=50, maxit=10, seed=0)
1. modelFit2 <- with(tempData2,lm(debt_income ~ age + year_emp + income))
2. summary(pool(modelFit2))