R语言tidyverse软件包学习Ⅱ-dplyr

dplyr是tidyverse软件包中数据处理的功能包，主要工作内容是数据清洗和整理。
主要功能有：行选择、列选择、统计汇总、窗口函数、数据框交集……是R语言进行数据分析必须掌握的工具。

管道符%>%、column、row

dplyr包的第一个函数用法就是管道符%>% ，他和linux服务器中的管道符含义一样：将上一个函数操作的输出结果作为下一个函数的输入数据。

x %>% f(y) 等价于  x %>% f(.,y)  等价于 f(x, y) 
例如：
head(iris,10)
iris %>% head(.,10)
iris %>% head(10)

操作对象行、列的概念：
Each variable is in its own column （每一列是一个变量）
Each observation, or case, is in its own row （每一行是一个观测对象，或者叫样本）

tibble数据框：
tidyverse工具包中常用的特殊数据框类型，主要关注于list，支持整洁格式；数据进行懒加载。也就是说，最大的好处是tibble中的元素可以是list类型
tidyverse中有一个 tibble包，以后会介绍

Summarise&count

Summarise将汇总函数应用于列，以创建汇总统计信息的新表。摘要函数接受向量作为输入并返回一个值。
更多搭配group_by（）函数一起使用。

count()是求行数的函数，可求出指定变量的行数

summarise(mtcars, avg = mean(mpg)) # summaries汇总
count(iris, Species)               # count求行数

mtcars %>% group_by(cyl) %>% summarise(mean = mean(disp), n = n())

Group Cases

group_by()

使用group_by()创建表的“分组”副本。
Dplyr函数将分别操作每个“组”进行，然后组合结果。
group_by()并不会对data frame立刻产生修改，而是增加一个分组的属性，会影响后续的操作函数。是dplyr包操作的一个非常重要的函数

by_species <- starwars %>% group_by(species)
by_sex_gender <- starwars %>% group_by(sex, gender)

#add new group在已经有分组的数据框上添加分组配置，需要.add参数
starwars %>% group_by(species) %>% group_by(sex, gender, .add = TRUE)

tally()

group_by()分组后，使用tally()来计算每组中的行数。如果您希望预先看到最大的组，可以使用sort参数对行数排序

by_species %>% tally()

by_sex_gender %>% tally(sort = TRUE)

group_keys()

使用group_keys()查看组数据。每个组有一行，每个组变量有一列

mtcars %>% group_by(cyl,vs) %>% group_keys()
#查看分组后的具体组名（只显示分组，其他没有了），这个功能已经包含在了tally()、count()功能里了，不常用。

ungroup()

ungroup()# 删除分组属性

rowwise()

rowwise()是一个特殊的分组方式：逐行分组，对数据框每行单独做一组来进行分组。
下面这个例子展示逐行分组的用处：

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()

df %>% mutate(m = mean(c(x, y, z)))               #这个不是我们想要的结果
df %>% rowwise() %>% mutate(m = mean(c(x, y, z))) #这样每行单独计算才是我们预期的结果

rowwise()的分组属性也是使用ungroup()来删除

Column-wise operations（列操作）

df %>% 
  group_by(g1, g2) %>% 
  summarise(a = mean(a), b = mean(b), c = mean(c), d = mean(d))
#group_by  分组 按照g1, g2对数据框分组
# summarise 汇总分析，一般都是搭配group_by来使用。

df %>% 
  group_by(g1, g2) %>% 
  summarise(across(a:d, mean))
#across 一个好用的统计函数

across()

Across()有两个主要参数:
第一个参数.cols选择要操作的列。它使用整齐的选择(如select())，因此您可以根据位置、名称和类型选择变量。
第二个参数.fns是应用于每个列的函数或函数列表。这也可以是一个purrr 风格的公式(或公式列表)，如~ .x / 2。
(这个参数是可选的，如果你只想获得底层数据，你可以忽略它;你会在 vignette("rowwise")看到这种技术。

starwars %>% 
  summarise(across(where(is.character), n_distinct))
#用where(is.character)筛选列
starwars %>% 
  group_by(species) %>% 
  filter(n() > 1) %>% 
  summarise(across(c(sex, gender, homeworld), n_distinct))
#用c()指定sex, gender, homeworld列
starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
#用where(is.numeric)筛选列

由于across()通常与summarise()和mutate()结合使用，因此它不会选择分组变量以避免意外修改它们:

df <- data.frame(g = c(1, 1, 2), x = c(-1, 1, 3), y = c(-1, -4, -9))
df %>% group_by(g) %>% summarise(across(where(is.numeric), sum))
#> # A tibble: 2 × 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     1     0    -5
#> 2     2     3    -9

across使用自定义函数功能

min_max <- list(
  min = ~min(.x, na.rm = TRUE), 
  max = ~max(.x, na.rm = TRUE)
)

starwars %>% summarise(across(where(is.numeric), min_max))
#> # A tibble: 1 × 6
#>   height_min height_max mass_min mass_max birth_year_min birth_year_max
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896

starwars %>% summarise(across(c(height, mass, birth_year), min_max))
#> # A tibble: 1 × 6
#>   height_min height_max mass_min mass_max birth_year_min birth_year_max
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896

across使用.name= 参数控制新列的名字

starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
#> # A tibble: 1 × 6
#>   min.height max.height min.mass max.mass min.birth_year max.birth_year
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896
starwars %>% summarise(across(c(height, mass, birth_year), min_max, .names = "{.fn}.{.col}"))
#> # A tibble: 1 × 6
#>   min.height max.height min.mass max.mass min.birth_year max.birth_year
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896
#################################################################################################################
starwars %>% summarise(
  across(c(height, mass, birth_year), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
  across(c(height, mass, birth_year), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
#> # A tibble: 1 × 6
#>   min_height min_mass min_birth_year max_height max_mass max_birth_year
#>        <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
#> 1         66       15              8        264     1358            896

relocate()

改变列的排列顺序：可以很容易地将一组列移动到一个新的位置(默认情况下，是移动到前面)

mtcars %>% relocate(gear, carb) #把gear, carb两列放到最前面

mtcars %>% relocate(mpg, cyl, .after = last_col()) #把mpg, cyl两列放到最后面

等同于R基础包的：
mtcars[union(c("gear", "carb"), names(mtcars))]

to_back <- c("mpg", "cyl")
mtcars[c(setdiff(names(mtcars), to_back), to_back)]

pull

从数据框中按列提取数据向量

mtcars %>% pull(1)
mtcars %>% pull(cyl)
mtcars %>% pull(var = -1)
#注意，提取出的数据不再是数据框格式，而是向量、数组格式。

等同于R基础包的：
mtcars[[1]]
mtcars$cyl

select

按位置、名称、名称函数或其他属性对列进行取子集

iris %>% select(1:3)                    # 选择1~3列
iris %>% select(Species, Sepal.Length)  # 选择Species, Sepal.Length列
iris %>% select(starts_with("Petal"))   # 选择"Petal"开头的列
iris %>% select(ends_with("Length"))    # 选择"Length"结尾的列
iris %>% select(contains("Petal"))      # 选择包含"Petal"字符的列
iris %>% select(where(is.factor))       # 选择 “因子”列

iris %>% select(where(is.factor),everything())  #其中everything()表示其他的列

等同于R基础包的：
iris[1:3]  # single argument selects columns; never drops
iris[,1:3, drop = FALSE]

iris[c("Species", "Sepal.Length")]
subset(iris, select = c(Species, Sepal.Length))

iris[grep("^Petal", names(iris))]

mutate()

创造新的列，添加到data frame中

df <- tibble(x = sample(10, 100, rep = TRUE),y = sample(10, 100, rep = TRUE) )
mutate(.data,...,.by = NULL,.keep = c("all", "used", "unused", "none"),.before = NULL,.after = NULL)
df %>% mutate(z = x + y, z2 = z ^ 2) #优点：可以直接使用刚创建的新变量Z

mtcars$cyl2 <- mtcars$cyl * 2
mtcars$cyl4 <- mtcars$cyl2 * 2

#配合分组使用，可以每组内单独计算
gf <- tibble(g = c(1, 1, 2, 2), x = c(0.5, 1.5, 2.5, 3.5))
gf %>% group_by(g) %>% mutate(x_mean = mean(x), x_rank = rank(x))

transmute()

创建计算新列，删除其他列

mtcars %>% transmute( gpm = 1/mpg) # 生成的数据框只有gpm这一列
#等同于mutate添加 .keep = "none"参数
mutate(.keep = "none")

add_column()

增加一个新列

 mtcars %>% add_column( new = 1:32) #增加一列new，从1到32赋值

rename()

按名称重命名变量：允许通过名称或位置重命名变量:

iris %>% rename(sepal_length = Sepal.Length, sepal_width = 2)
#Sepal.Length改名为sepal_length、第二列改名为sepal_width

等同于R基础包的：
iris2 <- iris
names(iris2)[2] <- "sepal_width"
names(iris2)[names(iris2) == "Sepal.Length"] <- "sepal_length"

rename_with()

使用函数修改列名

iris %>% rename_with(toupper) 
#所有列明全部改成大写字母

Row-wise operations（行操作）

filter()

返回符合匹配条件的行

starwars %>% filter(species == "Human") #过滤出 species等于Human的行
starwars %>% filter(mass > 1000) #过滤出 mass大于1000的行
starwars %>% filter(hair_color == "none" & eye_color == "black") #过滤出 hair_color 等于"none" 并且 eye_color 等于"black"的行
starwars %>% filter(hair_color == "none" , eye_color == "black") #过滤出 hair_color 等于"none" 并且 eye_color 等于"black"的行
starwars %>% filter(hair_color == "none" | eye_color == "black") #过滤出 hair_color 等于"none" 或者 eye_color 等于"black"的行

#在R基础包中order()函数可以实现相同功能：
subset(starwars, species == "Human")
subset(starwars, mass > 1000)
subset(starwars, hair_color == "none" & eye_color == "black")

distinct()

删除具有重复值的行。

distinct(.data, ..., .keep_all = FALSE)
#.keep_all 参数 是否保留其他的列。

df <- tibble(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)
)
df %>% distinct(x)
df %>% distinct(x, .keep_all = TRUE)  #.keep_all = TRUE参数保留其他行

#在R基础包中order()函数可以实现相同功能：
unique(df["x"])
df[!duplicated(df$x), , drop = FALSE]

sample_frac()

按百分比抽取行

iris %>% sample_frac(0.5, replace = TRUE)

sample_n()

按数量抽取行

iris %>% sample_n( 10, replace = TRUE)

slice()

按位置选择行（按行取切片）

slice(mtcars, 25:n())  #取mtcars的25到最后一行

top_n

按某个变量排序，并选择前n个

mtcars %>% top_n(5, hp)
mtcars %>% top_n(5, desc(hp))

arrange()desc()

按一个或多个列的值对data frame的行排序:

mtcars %>% arrange(cyl)                    #依据cyl个参数对mtcars的行排序（默认升序）
mtcars %>% arrange(cyl, disp)              #依据cyl、disp两个参数对mtcars的行排序（默认升序）
mtcars %>% arrange(desc(cyl), desc(disp))  #依据cyl、disp两个参数对mtcars的行排序（desc（）使用降序）
mtcars %>% arrange(desc(cyl), disp)        #依据cyl、disp两个参数对mtcars的行排序（可以实现一个参数升序，另一个参数降序）

#在R基础包中order()函数可以实现相同功能：
mtcars[order(mtcars$cyl, mtcars$disp), , drop = FALSE]
#注意drop = FALSE的用法。如果您忘记了这一点，并且输入是具有单列的数据帧，那么输出将是一个向量，而不是数据帧。这是一种微妙的bug的来源。
#R基础包没有降序的方法，可以使用下面两种方法：
1：对于数值变量，可以使用-x。
2：可以请求order()按降序对所有变量进行排序。
mtcars[order(mtcars$cyl, mtcars$disp, decreasing = TRUE), , drop = FALSE]
mtcars[order(-mtcars$cyl, -mtcars$disp), , drop = FALSE]

add_row

向表中添加一行或多行

add_row(faithful, eruptions = 1, waiting = 1)

Combine Tables（Two-table verbs）

两个数据框之间的链接，在Rbase中一般使用merge()函数，在dplyr中使用join()函数。

dplyr	base
inner_join(df1, df2)	merge(df1, df2)
left_join(df1, df2)	merge(df1, df2, all.x = TRUE)
right_join(df1, df2)	merge(df1, df2, all.y = TRUE)
full_join(df1, df2)	merge(df1, df2, all = TRUE)
semi_join(df1, df2)	df1[df1$x %in% df2$x, , drop = FALSE]
anti_join(df1, df2)	df1[!df1$x %in% df2$x, , drop = FALSE]

For more information about two-table verbs, see vignette(“two-table”)

Mutating joins

使用“Mutating joins”(变换连接)将一个表与另一个表的列连接起来，将值与它们对应的行进行匹配。每个连接保留表中不同的值组合。
有四种连接方式：

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = 10, b = "a")

lef_join()

df1 %>% left_join(df2)
#左连接，以左边数据框为主，把右边数据框中符合的数据贴到左边数据框中。

right_join()

df1 %>% right_join(df2)
#右连接，以右边数据框为主，把左边数据框中符合的数据贴到右边数据框中。

inner_join()

df1 %>% inner_join(df2)
df2 %>% inner_join(df1)
#两个结果内容是一样的，只是观测对象、观测变量 的排序不一样。
#交集连接，把两个数据框 共有的观测值 都汇总在一个总的data farme结果里

full_join()

df1 %>% full_join(df2)
df2 %>% full_join(df1) 
#两个结果内容是一样的，只是观测对象、观测变量 的排序不一样。
#并集连接，把两个数据框 全部的观测值 都汇总在一个总的data farme结果里

Filtering joins

“Filtering joins”（筛选连接）是根据另一个表的行筛选一个表。

x <- tibble(A=c('a','b','c'),B=c('t','u','v'),C=c('1','2','3'))
y <- tibble(A=c('a','b','c'),B=c('t','u','w'),D=c('3','2','1'))

semi_join()

x %>% semi_join(y)
#保留x中所有与y匹配的观测值。

anti_join()

x %>% anti_join(y)
#删除x中与y匹配的所有观测值。

注意：在上面六种连接中*_join可以通过by = c(“col1”, “col2”, …)参数来指定用于匹配的列（可以指定多列）；也可以通过by = join_by(“col1″ =”col2”)指定两个数据框中不同名字的两列匹配。

Set operations(集合运算)

x <- tibble(A=c("a","b","c"),B=c("t","u","v"),C=c(1,2,3))
y <- tibble(A=c("c","d"),B=c("v","w"),C=c(3,4))

intersect()

intersect(x, y) #只返回x和y中共有的观测值（取交集）

union()

union(x, y)    #返回x和y中全部的观测值（取并集）

setdiff()

setdiff(x, y) #返回在x中存在，y中不存在的观测值（取非）
setdiff(y, x) #返回在y中存在，x中不存在的观测值（取非）

附件：dplyr的一个快速查找工具

blog.biodatas.cn/wp-content/uploads/2024/07/dplyr.pdf

本文只介绍了基础知识点，还有更多的dplyr在自定义function中的使用、裸子块使用注意事项、变量传参注意事项、函数内部数据传输的数组格式，数据框格式…在官方文档可查看学习。
dplyr的官方说明文档可在Rstudio使用函数查看：browseVignettes(package = “dplyr”)