缘起
缘起还是这个长期存在的争议:R or python?昨天在cos八卦群里看到大佬们关于 pandas 和 tidy r 在数据清洗/预处理方面的讨论(这可太真实了哈哈哈!),突然想了解一下 Hadley Wickham 大佬关于 tidy data 的思想。正巧最近想要寻找实习,R 的数据结构和语句在课上基本了解了,不过数据处理这个关键部分还未涉猎,so~

这篇文章地址:
Hadley Wickham's 《Tidy data》
定义tidy data
Tidy data is a standard way of mapping the meaning of a dataset to its structure. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
在线尴尬,第一次更新R版本,没想到好多包也需要重装,磨磨唧唧一早上,哭唧唧
> library(tidyr)
Error: package or namespace load failed for ‘tidyr’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
namespace ‘rlang’ 0.3.1 is already loaded, but >= 0.4.0 is required
> R.version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On
> R.version
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
常见处理操作
- dplyr包是Hadley Wickham的新作,主要用于数据清洗和整理
- tidyr包用于“tidy”数据,常跟dplyr结合使用
- 注意链式操作(管道) %>% 或 %.%,读作then,按数据处理的思路写代码,易写且易读(简直神仙R包,copss奖实至名归)
提纲:
1.Pivot data from wide to long
2.dplyr
3.Multiple variables stored in one column
2.Variables are stored in both rows and columns
1. Pivot data from wide to long
tidyr 包提供了 pivot_longer() 函数来实现。
例子1 income_religion
- 原始数据:每个变量都是收入区间取值
- 融合后:变量一列被重命名为income,值变成一列重命名为freq
raw %>%
pivot_longer(- religion, names_to = "income", values_to = "frequency")
含义:除了religion列,其余都变成income列,取值变为freq列
> library(tidyr)
> library(dplyr)
> head(raw)
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k
1 Agnostic 27 34 60 81 76 137 122 109
2 Atheist 12 27 37 52 35 70 73 59
3 Buddhist 27 21 30 34 33 58 62 39
4 Catholic 418 617 732 670 638 1116 949 792
5 Don’t know/refused 15 14 15 11 10 35 21 17
6 Evangelical Prot 575 869 1064 982 881 1486 949 723
>150k Don't know/refused
1 84 96
2 74 76
3 53 54
4 633 1489
5 18 116
6 414 1529
> raw %>%
+ pivot_longer(- religion, names_to = "income", values_to = "frequency")
# A tibble: 180 x 3
religion income frequency
<chr> <chr> <int>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# … with 170 more rows
例子2 billboard
- 原始数据2:歌曲上榜前100后的各周的排名
- 融合后:列被转换为week
> head(billboard[, 1:10])
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7
1 Destiny's Child Independent Women Part I 2000-09-23 78 63 49 33 23 15 7
2 Santana Maria, Maria 2000-02-12 15 8 6 5 2 3 2
3 Savage Garden I Knew I Loved You 1999-10-23 71 48 43 31 20 13 7
4 Madonna Music 2000-08-12 41 23 18 14 2 1 1
5 Aguilera, Christina Come On Over Baby 2000-08-05 57 47 45 29 23 18 11
6 Janet Doesn't Really Matter 2000-06-17 59 52 43 30 29 22 15
> billboard2 <- billboard %>%
+ pivot_longer(
+ wk1:wk76, # 需要变成一列的那些变量
+ names_to = "week", # 变成的列名 week
+ values_to = "rank", # 值 变成另一列 rank
+ values_drop_na = TRUE
+ )
> billboard2
# A tibble: 5,307 x 5
artist track date.entered week rank
<chr> <chr> <chr> <chr> <int>
1 Destiny's Child Independent Women Part I 2000-09-23 wk1 78
2 Destiny's Child Independent Women Part I 2000-09-23 wk2 63
3 Destiny's Child Independent Women Part I 2000-09-23 wk3 49
4 Destiny's Child Independent Women Part I 2000-09-23 wk4 33
5 Destiny's Child Independent Women Part I 2000-09-23 wk5 23
6 Destiny's Child Independent Women Part I 2000-09-23 wk6 15
7 Destiny's Child Independent Women Part I 2000-09-23 wk7 7
8 Destiny's Child Independent Women Part I 2000-09-23 wk8 5
9 Destiny's Child Independent Women Part I 2000-09-23 wk9 1
10 Destiny's Child Independent Women Part I 2000-09-23 wk10 1
# … with 5,297 more rows
2. dplyr
2.1 mutate: Create, modify, and delete columns
dplyr 包括函数 mutate {dplyr} 整理每一列数据格式,常用:sqrt/log/ifelse/as.Date/....
比如:
> data(iris)
> iris2 <- iris %>%
+ group_by(Species) %>%
+ mutate( # 会自动按照分组后的计算组均值
+ mean_slength = mean(Sepal.Length)
+ )
> table(iris2$mean_slength)
5.006 5.936 6.588
50 50 50
继续整理billboard 数据集,提取周的数字和具体日期
> billboard3 <- billboard2 %>%
+ mutate(
+ week = as.integer(gsub("wk", "", week)), # 把第xx周的数字提取出来
+ date = as.Date(date.entered) + 7 * (week - 1), # 自动计算日期
+ date.entered = NULL # 把原本 date.entered 列删掉
+ )
> billboard3
# A tibble: 5,307 x 5
artist track week rank date
<chr> <chr> <int> <int> <date>
1 Destiny's Child Independent Women Part I 1 78 2000-09-23
2 Destiny's Child Independent Women Part I 2 63 2000-09-30
3 Destiny's Child Independent Women Part I 3 49 2000-10-07
4 Destiny's Child Independent Women Part I 4 33 2000-10-14
5 Destiny's Child Independent Women Part I 5 23 2000-10-21
6 Destiny's Child Independent Women Part I 6 15 2000-10-28
7 Destiny's Child Independent Women Part I 7 7 2000-11-04
8 Destiny's Child Independent Women Part I 8 5 2000-11-11
9 Destiny's Child Independent Women Part I 9 1 2000-11-18
10 Destiny's Child Independent Women Part I 10 1 2000-11-25
# … with 5,297 more rows
2.2 arrange: Order a data frame by its colums.
对 dataframe 的排序,按照一些列的取值
继续整理billboard 数据集,按照日期和排名进行排序:
> billboard4 <- billboard3 %>%
+ arrange(date, rank)
> billboard4[3351:3360, ]
# A tibble: 10 x 5
artist track week rank date
<chr> <chr> <int> <int> <date>
1 Janet Doesn't Really Matter 11 1 2000-08-26
2 Sisqo Incomplete 10 2 2000-08-26
3 matchbox twenty Bent 18 3 2000-08-26
4 Destiny's Child Jumpin' Jumpin' 16 4 2000-08-26
5 Joe I Wanna Know 35 5 2000-08-26
6 N'Sync It's Gonna Be Me 17 6 2000-08-26
7 Ruff Endz No More 9 7 2000-08-26
8 Nine Days Absolutely 17 8 2000-08-26
9 Aaliyah Try Again 24 9 2000-08-26
10 Vertical Horizon Everything You Want 32 10 2000-08-26
3. 多个变量储存在一列
原始数据:列变量名是性别(m, f) 、年龄(0-14, 15-25,...)的组合
- 需要宽表变长表
- 需要把列变量名分割成为两部分——形成两个变量
> head(tb[, 1:10])
iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564
1 AD 1989 NA NA NA NA NA NA NA NA
2 AD 1990 NA NA NA NA NA NA NA NA
3 AD 1991 NA NA NA NA NA NA NA NA
4 AD 1992 NA NA NA NA NA NA NA NA
5 AD 1993 NA NA NA NA NA NA NA NA
6 AD 1994 NA NA NA NA NA NA NA NA
>
> # 方法一:使用 pivot_longer + separate
> ### 1.第一步,宽变长,使用pivot_longer
> tb2 <- tb %>%
+ pivot_longer(
+ !c(iso2, year),
+ names_to = "demo", # 变量——> demo列
+ values_to = "n", # 取值——> n列
+ values_drop_na = TRUE
+ )
> tb2
# A tibble: 35,750 x 4
iso2 year demo n
<chr> <int> <chr> <int>
1 AD 1996 m014 0
2 AD 1996 m1524 0
3 AD 1996 m2534 0
4 AD 1996 m3544 4
5 AD 1996 m4554 1
6 AD 1996 m5564 0
7 AD 1996 m65 0
8 AD 1996 f014 0
9 AD 1996 f1524 1
10 AD 1996 f2534 1
# … with 35,740 more rows
> ### 2.第二步,把demo列字符分割为两部分,形成两列 sex 和 age
> tb3 <- tb2 %>%
+ separate(demo, c("sex", "age"), 1)
> tb3
# A tibble: 35,750 x 5
iso2 year sex age n
<chr> <int> <chr> <chr> <int>
1 AD 1996 m 014 0
2 AD 1996 m 1524 0
3 AD 1996 m 2534 0
4 AD 1996 m 3544 4
5 AD 1996 m 4554 1
6 AD 1996 m 5564 0
7 AD 1996 m 65 0
8 AD 1996 f 014 0
9 AD 1996 f 1524 1
10 AD 1996 f 2534 1
# … with 35,740 more rows
>
>
> # 方法二:使用更复杂的 pivot_longer
> tb %>% pivot_longer(
+ !c(iso2, year),
+ names_to = c("sex", "age"),
+ names_pattern = "(.)(.+)",
+ values_to = "n",
+ values_drop_na = TRUE
+ )
# A tibble: 35,750 x 5
iso2 year sex age n
<chr> <int> <chr> <chr> <int>
1 AD 1996 m 014 0
2 AD 1996 m 1524 0
3 AD 1996 m 2534 0
4 AD 1996 m 3544 4
5 AD 1996 m 4554 1
6 AD 1996 m 5564 0
7 AD 1996 m 65 0
8 AD 1996 f 014 0
9 AD 1996 f 1524 1
10 AD 1996 f 2534 1
# … with 35,740 more rows
4. 变量既在列中存储,又在行中存储。
- 两个变量存储在行里:tmin和tmax
- 每日日期储存在列里:d1, d2,...
这里尝试用多次 %>% 把所有操作按部就班then连接在一起
> head(raw[,1:10])
id year month element d1 d2 d3 d4 d5 d6
1 MX17004 2010 1 tmax NA NA NA NA NA NA
2 MX17004 2010 1 tmin NA NA NA NA NA NA
3 MX17004 2010 2 tmax NA 27.3 24.1 NA NA NA
4 MX17004 2010 2 tmin NA 14.4 14.4 NA NA NA
5 MX17004 2010 3 tmax NA NA NA NA 32.1 NA
6 MX17004 2010 3 tmin NA NA NA NA 14.2 NA
> raw %>% # 1.首先处理列变量
+ pivot_longer(
+ d1:d31,
+ names_to = "day",
+ values_to = 'value',
+ values_drop_na = TRUE
+ ) %>% # 2.然后处理行变量
+ pivot_wider(
+ names_from = element, values_from = value
+ ) %>% # 3.提取day中的数字部分
+ mutate(
+ day = as.integer(gsub("d", "", day))
+ )
# A tibble: 33 x 6
id year month day tmax tmin
<chr> <int> <int> <int> <dbl> <dbl>
1 MX17004 2010 1 30 27.8 14.5
2 MX17004 2010 2 2 27.3 14.4
3 MX17004 2010 2 3 24.1 14.4
4 MX17004 2010 2 11 29.7 13.4
5 MX17004 2010 2 23 29.9 10.7
6 MX17004 2010 3 5 32.1 14.2
7 MX17004 2010 3 10 34.5 16.8
8 MX17004 2010 3 16 31.1 17.6
9 MX17004 2010 4 27 36.3 16.7
10 MX17004 2010 5 27 33.2 18.2
# … with 23 more rows
小总结
似乎在数据清理方面,代码可读性&简洁性都满分 ~ 其实 python 并不会比R高贵,不要抱怨为什么学校只教 r 作业也需要 r 完成,因为确实好用(如果你觉得难用,只是因为不熟悉或者不适应👏当然现阶段,面临面试笔试的我们,当然还是需要两个都掌握(听到一些同学对 R 的偏见之后的小感慨.....
最后放上李舰师兄的两句话,即使是coding初级选手,也要记得严格要求自己代码习惯呀,尤其是可读性方面:

