R 语言的聚类方法合集

发表评论
12,092 阅读

A+

距离和相似系数

r 语言中使用 dist ( x， method = “ euclidean ”， diag = FALSE， upper = FALSE， p = 2 ) 来计算距离。其中x是样本矩阵或者数据框。method 表示计算哪种距离。method 的取值有：

euclidean 欧几里德距离，就是平方再开方。
maximum 切比雪夫距离
manhattan 绝对值距离
canberra Lance 距离
minkowski 明科夫斯基距离，使用时要指定p值
binary 定性变量距离.

定性变量距离：记 m 个项目里面的 0：0 配对数为 m0 ，1：1 配对数为 m1，不能配对数为 m2，距离= m1 /( m1 + m2 )；

diag 为TRUE的时候给出对角线上的距离。upper 为TURE的时候给出上三角矩阵上的值。

r语言中使用 scale ( x， center = TRUE， scale = TRUE ) 对数据矩阵做中心化和标准化变换。

如只中心化 scale ( x，scale=F ) ，

r 语言中使用 sweep ( x， MARGIN， STATS， FUN=”-“， …) 对矩阵进行运算。MARGIN 为 1，表示行的方向上进行运算，为 2 表示列的方向上运算。STATS 是运算的参数。FUN 为运算函数，默认是减法。下面利用 sweep 对矩阵x进行极差标准化变换

>center <- sweep(x， 2， apply(x， 2， mean)) #在列的方向上减去均值。
>R <- apply(x， 2， max) - apply(x，2，min) #算出极差，即列上的最大值-最小值
>x_star <- sweep(center， 2， R， "/") #把减去均值后的矩阵在列的方向上除以极差向量
>center <- sweep(x, 2, apply(x, 2, min)) # 极差正规化变换
>R <- apply(x, 2, max) - apply(x,2,min)
>x_star <- sweep(center, 2, R, "/")

有时候我们不是对样本进行分类，而是对变量进行分类。这时候，我们不计算距离，而是计算变量间的相似系数。常用的有夹角和相关系数。

r 语言计算两向量的夹角余弦：

y <- scale(x， center = F， scale = T)/sqrt(nrow(x)-1)
C <- t(y) %*% y

相关系数用 cor 函数

层次聚类法

层次聚类法。先计算样本之间的距离。每次将距离最近的点合并到同一个类。然后，再计算类与类之间的距离，将距离最近的类合并为一个大类。不停的合并，直到合成了一个类。其中类与类的距离的计算方法有：最短距离法，最长距离法，中间距离法，类平均法等。比如最短距离法，将类与类的距离定义为类与类之间样本的最段距离。。。r 语言中使用 hclust (d， method = “complete”， members=NULL) 来进行层次聚类。

其中 d 为距离矩阵。

method 表示类的合并方法，有：

single 最短距离法
complete 最长距离法
median 中间距离法
mcquitty 相似法
average 类平均法
centroid 重心法
ward 离差平方和法

> x <- c(1,2,6,8,11) #试用一下
> dim(x) <- c(5,1)
> d <- dist(x)
> hc1 <- hclust(d,"single")
> plot(hc1)
> plot(hc1,hang=-1,type="tirangle") #hang小于0时，树将从底部画起。
#type = c("rectangle", "triangle"),默认树形图是方形的。另一个是三角形。
#horiz TRUE 表示竖着放，FALSE表示横着放。

> z <- scan()
1： 1.000 0.846 0.805 0.859 0.473 0.398 0.301 0.382
9： 0.846 1.000 0.881 0.826 0.376 0.326 0.277 0.277
17： 0.805 0.881 1.000 0.801 0.380 0.319 0.237 0.345
25： 0.859 0.826 0.801 1.000 0.436 0.329 0.327 0.365
33： 0.473 0.376 0.380 0.436 1.000 0.762 0.730 0.629
41： 0.398 0.326 0.319 0.329 0.762 1.000 0.583 0.577
49： 0.301 0.277 0.237 0.327 0.730 0.583 1.000 0.539
57： 0.382 0.415 0.345 0.365 0.629 0.577 0.539 1.000
65：
Read 64 items
> names
[1] "shengao" "shoubi" "shangzhi" "xiazhi" "tizhong"
[6] "jingwei" "xiongwei" "xiongkuang"
> r <- matrix(z，nrow=8，dimnames=list(names，names))
> d <- as.dist(1-r)
> hc <- hclust(d)
> plot(hc)

然后可以用 rect.hclust(tree， k = NULL， which = NULL， x = NULL， h = NULL，border = 2， cluster = NULL)来确定类的个数。 tree 就是求出来的对象。k 为分类的个数，h 为类间距离的阈值。border 是画出来的颜色，用来分类的。

> plot(hc)
> rect.hclust(hc，k=2)
> rect.hclust(hc，h=0.5)

result=cutree(model,k=3) 该函数可以用来提取每个样本的所属类别

动态聚类 kmeans

层次聚类，在类形成之后就不再改变。而且数据比较大的时候更占内存。

动态聚类，先抽几个点，把周围的点聚集起来。然后算每个类的重心或平均值什么的，以算出来的结果为分类点，不断的重复。直到分类的结果收敛为止。r 语言中主要使用kmeans (x， centers， iter.max = 10， nstart = 1，algorithm =c(“Hartigan-Wong”， “Lloyd”，”Forgy”， “MacQueen”))来进行聚类。centers 是初始类的个数或者初始类的中心。iter.max 是最大迭代次数。nstart 是当 centers 是数字的时候，随机集合的个数。algorithm 是算法，默认是第一个。

> newiris <- iris
> model <- kmeans(scale(newiris[1：4])，3)
> model
K-means clustering with 3 clusters of sizes 50， 47， 53
Cluster means：
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -1.01119138 0.85041372 -1.3006301 -1.2507035
2 1.13217737 0.08812645 0.9928284 1.0141287
3 -0.05005221 -0.88042696 0.3465767 0.2805873
Clustering vector：
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3
[75] 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
[112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2
[149] 2 3
Within cluster sum of squares by cluster：
[1] 47.35062 47.45019 44.08754
(between_SS / total_SS = 76.7 %)
Available components：
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
> table(iris$Species，kc$cluster)
Error in table(iris$Species， kc$cluster) ： object 'kc' not found
> table(iris$Species，model$cluster) #比较一下
1 2 3
setosa 50 0 0
versicolor 0 11 39
virginica 0 36 14
> plot(newiris[c("Sepal.Length"，"Sepal.Width")]，col=model$cluster) #画出聚类图

DBSCAN

动态聚类往往聚出来的类有点圆形或者椭圆形。基于密度扫描的算法能够解决这个问题。思路就是定一个距离半径，定最少有多少个点，然后把可以到达的点都连起来，判定为同类。在r中的实现

dbscan(data， eps， MinPts， scale， method， seeds， showplot， countmode)

其中 eps 是距离的半径，minpts 是最少多少个点。 scale 是否标准化（我猜) ，method 有三个值 raw，dist，hybird，分别表示，数据是原始数据避免计算距离矩阵，数据就是距离矩阵，数据是原始数据但计算部分距离矩阵。showplot 画不画图，0 不画，1 和 2 都画。countmode，可以填个向量，用来显示计算进度。用鸢尾花试一试

> install.packages("fpc"， dependencies=T)
> library(fpc)
> newiris <- iris[1：4]
> model <- dbscan(newiris，1.5，5，scale=T，showplot=T，method="raw")# 画出来明显不对把距离调小了一点
> model <- dbscan(newiris，0.5，5，scale=T，showplot=T，method="raw")
> model #还是不太理想……
dbscan Pts=150 MinPts=5 eps=0.5
0 1 2
border 34 5 18
seed 0 40 53
total 34 45 71