R | Tidy data

缘起

缘起还是这个长期存在的争议:R or python?昨天在cos八卦群里看到大佬们关于 pandas 和 tidy r 在数据清洗/预处理方面的讨论(这可太真实了哈哈哈!),突然想了解一下 Hadley Wickham 大佬关于 tidy data 的思想。正巧最近想要寻找实习,R 的数据结构和语句在课上基本了解了,不过数据处理这个关键部分还未涉猎,so~

这篇文章地址:
Hadley Wickham's 《Tidy data》

定义tidy data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. In tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

在线尴尬,第一次更新R版本,没想到好多包也需要重装,磨磨唧唧一早上,哭唧唧

> library(tidyr)
Error: package or namespace load failed for ‘tidyr’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 namespace ‘rlang’ 0.3.1 is already loaded, but >= 0.4.0 is required
> R.version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.4                         
year           2018                        
month          03                          
day            15                          
svn rev        74408                       
language       R                           
version.string R version 3.4.4 (2018-03-15)
nickname       Someone to Lean On  

> R.version
               _                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out  

常见处理操作

  • dplyr包是Hadley Wickham的新作,主要用于数据清洗和整理
  • tidyr包用于“tidy”数据,常跟dplyr结合使用
  • 注意链式操作(管道) %>% 或 %.%,读作then,按数据处理的思路写代码,易写且易读(简直神仙R包,copss奖实至名归)

提纲:
1.Pivot data from wide to long
2.dplyr
3.Multiple variables stored in one column
2.Variables are stored in both rows and columns

1. Pivot data from wide to long

tidyr 包提供了 pivot_longer() 函数来实现。

例子1 income_religion

  • 原始数据:每个变量都是收入区间取值
  • 融合后:变量一列被重命名为income,值变成一列重命名为freq

raw %>%
pivot_longer(- religion, names_to = "income", values_to = "frequency")
含义:除了religion列,其余都变成income列,取值变为freq列

> library(tidyr)
> library(dplyr)
> head(raw)
            religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k
1           Agnostic    27      34      60      81      76     137      122       109
2            Atheist    12      27      37      52      35      70       73        59
3           Buddhist    27      21      30      34      33      58       62        39
4           Catholic   418     617     732     670     638    1116      949       792
5 Don’t know/refused    15      14      15      11      10      35       21        17
6   Evangelical Prot   575     869    1064     982     881    1486      949       723
  >150k Don't know/refused
1    84                 96
2    74                 76
3    53                 54
4   633               1489
5    18                116
6   414               1529
> raw %>% 
+   pivot_longer(- religion, names_to = "income", values_to = "frequency")
# A tibble: 180 x 3
   religion income             frequency
   <chr>    <chr>                  <int>
 1 Agnostic <$10k                     27
 2 Agnostic $10-20k                   34
 3 Agnostic $20-30k                   60
 4 Agnostic $30-40k                   81
 5 Agnostic $40-50k                   76
 6 Agnostic $50-75k                  137
 7 Agnostic $75-100k                 122
 8 Agnostic $100-150k                109
 9 Agnostic >150k                     84
10 Agnostic Don't know/refused        96
# … with 170 more rows

例子2 billboard

  • 原始数据2:歌曲上榜前100后的各周的排名
  • 融合后:列被转换为week
> head(billboard[, 1:10])
               artist                    track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7
1     Destiny's Child Independent Women Part I   2000-09-23  78  63  49  33  23  15   7
2             Santana             Maria, Maria   2000-02-12  15   8   6   5   2   3   2
3       Savage Garden       I Knew I Loved You   1999-10-23  71  48  43  31  20  13   7
4             Madonna                    Music   2000-08-12  41  23  18  14   2   1   1
5 Aguilera, Christina        Come On Over Baby   2000-08-05  57  47  45  29  23  18  11
6               Janet    Doesn't Really Matter   2000-06-17  59  52  43  30  29  22  15
> billboard2 <- billboard %>% 
+   pivot_longer(
+     wk1:wk76,   # 需要变成一列的那些变量
+     names_to = "week",   # 变成的列名 week
+     values_to = "rank",   #  值 变成另一列 rank
+     values_drop_na = TRUE
+   )
> billboard2
# A tibble: 5,307 x 5
   artist          track                    date.entered week   rank
   <chr>           <chr>                    <chr>        <chr> <int>
 1 Destiny's Child Independent Women Part I 2000-09-23   wk1      78
 2 Destiny's Child Independent Women Part I 2000-09-23   wk2      63
 3 Destiny's Child Independent Women Part I 2000-09-23   wk3      49
 4 Destiny's Child Independent Women Part I 2000-09-23   wk4      33
 5 Destiny's Child Independent Women Part I 2000-09-23   wk5      23
 6 Destiny's Child Independent Women Part I 2000-09-23   wk6      15
 7 Destiny's Child Independent Women Part I 2000-09-23   wk7       7
 8 Destiny's Child Independent Women Part I 2000-09-23   wk8       5
 9 Destiny's Child Independent Women Part I 2000-09-23   wk9       1
10 Destiny's Child Independent Women Part I 2000-09-23   wk10      1
# … with 5,297 more rows

2. dplyr

2.1 mutate: Create, modify, and delete columns

dplyr 包括函数 mutate {dplyr} 整理每一列数据格式,常用:sqrt/log/ifelse/as.Date/....

比如:

> data(iris)
> iris2 <- iris %>%
+   group_by(Species) %>%
+     mutate(  # 会自动按照分组后的计算组均值
+       mean_slength = mean(Sepal.Length)
+     )
> table(iris2$mean_slength)

5.006 5.936 6.588 
   50    50    50 

继续整理billboard 数据集,提取周的数字和具体日期

> billboard3 <- billboard2 %>%
+   mutate(
+     week = as.integer(gsub("wk", "", week)),   # 把第xx周的数字提取出来
+     date = as.Date(date.entered) + 7 * (week - 1),  # 自动计算日期
+     date.entered = NULL    # 把原本 date.entered 列删掉
+   )
> billboard3
# A tibble: 5,307 x 5
   artist          track                     week  rank date      
   <chr>           <chr>                    <int> <int> <date>    
 1 Destiny's Child Independent Women Part I     1    78 2000-09-23
 2 Destiny's Child Independent Women Part I     2    63 2000-09-30
 3 Destiny's Child Independent Women Part I     3    49 2000-10-07
 4 Destiny's Child Independent Women Part I     4    33 2000-10-14
 5 Destiny's Child Independent Women Part I     5    23 2000-10-21
 6 Destiny's Child Independent Women Part I     6    15 2000-10-28
 7 Destiny's Child Independent Women Part I     7     7 2000-11-04
 8 Destiny's Child Independent Women Part I     8     5 2000-11-11
 9 Destiny's Child Independent Women Part I     9     1 2000-11-18
10 Destiny's Child Independent Women Part I    10     1 2000-11-25
# … with 5,297 more rows

2.2 arrange: Order a data frame by its colums.

对 dataframe 的排序,按照一些列的取值

继续整理billboard 数据集,按照日期和排名进行排序:

> billboard4 <- billboard3 %>% 
+   arrange(date, rank) 
> billboard4[3351:3360, ]
# A tibble: 10 x 5
   artist           track                  week  rank date      
   <chr>            <chr>                 <int> <int> <date>    
 1 Janet            Doesn't Really Matter    11     1 2000-08-26
 2 Sisqo            Incomplete               10     2 2000-08-26
 3 matchbox twenty  Bent                     18     3 2000-08-26
 4 Destiny's Child  Jumpin' Jumpin'          16     4 2000-08-26
 5 Joe              I Wanna Know             35     5 2000-08-26
 6 N'Sync           It's Gonna Be Me         17     6 2000-08-26
 7 Ruff Endz        No More                   9     7 2000-08-26
 8 Nine Days        Absolutely               17     8 2000-08-26
 9 Aaliyah          Try Again                24     9 2000-08-26
10 Vertical Horizon Everything You Want      32    10 2000-08-26

3. 多个变量储存在一列

原始数据:列变量名是性别(m, f) 、年龄(0-14, 15-25,...)的组合

  1. 需要宽表变长表
  2. 需要把列变量名分割成为两部分——形成两个变量
> head(tb[, 1:10])
  iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564
1   AD 1989  NA   NA   NA    NA    NA    NA    NA    NA
2   AD 1990  NA   NA   NA    NA    NA    NA    NA    NA
3   AD 1991  NA   NA   NA    NA    NA    NA    NA    NA
4   AD 1992  NA   NA   NA    NA    NA    NA    NA    NA
5   AD 1993  NA   NA   NA    NA    NA    NA    NA    NA
6   AD 1994  NA   NA   NA    NA    NA    NA    NA    NA
> 
> # 方法一:使用 pivot_longer + separate 
> ### 1.第一步,宽变长,使用pivot_longer
> tb2 <- tb %>% 
+   pivot_longer(
+     !c(iso2, year), 
+     names_to = "demo",    # 变量——> demo列
+     values_to = "n",      # 取值——> n列
+     values_drop_na = TRUE
+   )
> tb2
# A tibble: 35,750 x 4
   iso2   year demo      n
   <chr> <int> <chr> <int>
 1 AD     1996 m014      0
 2 AD     1996 m1524     0
 3 AD     1996 m2534     0
 4 AD     1996 m3544     4
 5 AD     1996 m4554     1
 6 AD     1996 m5564     0
 7 AD     1996 m65       0
 8 AD     1996 f014      0
 9 AD     1996 f1524     1
10 AD     1996 f2534     1
# … with 35,740 more rows
> ### 2.第二步,把demo列字符分割为两部分,形成两列 sex 和 age
> tb3 <- tb2 %>% 
+   separate(demo, c("sex", "age"), 1)
> tb3
# A tibble: 35,750 x 5
   iso2   year sex   age       n
   <chr> <int> <chr> <chr> <int>
 1 AD     1996 m     014       0
 2 AD     1996 m     1524      0
 3 AD     1996 m     2534      0
 4 AD     1996 m     3544      4
 5 AD     1996 m     4554      1
 6 AD     1996 m     5564      0
 7 AD     1996 m     65        0
 8 AD     1996 f     014       0
 9 AD     1996 f     1524      1
10 AD     1996 f     2534      1
# … with 35,740 more rows
> 
> 
> # 方法二:使用更复杂的 pivot_longer
> tb %>% pivot_longer(
+   !c(iso2, year), 
+   names_to = c("sex", "age"), 
+   names_pattern = "(.)(.+)",
+   values_to = "n", 
+   values_drop_na = TRUE
+ )
# A tibble: 35,750 x 5
   iso2   year sex   age       n
   <chr> <int> <chr> <chr> <int>
 1 AD     1996 m     014       0
 2 AD     1996 m     1524      0
 3 AD     1996 m     2534      0
 4 AD     1996 m     3544      4
 5 AD     1996 m     4554      1
 6 AD     1996 m     5564      0
 7 AD     1996 m     65        0
 8 AD     1996 f     014       0
 9 AD     1996 f     1524      1
10 AD     1996 f     2534      1
# … with 35,740 more rows

4. 变量既在列中存储,又在行中存储。

  • 两个变量存储在行里:tmin和tmax
  • 每日日期储存在列里:d1, d2,...

这里尝试用多次 %>% 把所有操作按部就班then连接在一起

> head(raw[,1:10])
       id year month element d1   d2   d3 d4   d5 d6
1 MX17004 2010     1    tmax NA   NA   NA NA   NA NA
2 MX17004 2010     1    tmin NA   NA   NA NA   NA NA
3 MX17004 2010     2    tmax NA 27.3 24.1 NA   NA NA
4 MX17004 2010     2    tmin NA 14.4 14.4 NA   NA NA
5 MX17004 2010     3    tmax NA   NA   NA NA 32.1 NA
6 MX17004 2010     3    tmin NA   NA   NA NA 14.2 NA
> raw %>%   # 1.首先处理列变量
+   pivot_longer(
+     d1:d31,
+     names_to = "day",
+     values_to = 'value',
+     values_drop_na = TRUE
+   ) %>%   # 2.然后处理行变量
+     pivot_wider(
+       names_from = element, values_from = value
+       ) %>%  # 3.提取day中的数字部分
+         mutate(
+           day = as.integer(gsub("d", "", day))
+           )
# A tibble: 33 x 6
   id       year month   day  tmax  tmin
   <chr>   <int> <int> <int> <dbl> <dbl>
 1 MX17004  2010     1    30  27.8  14.5
 2 MX17004  2010     2     2  27.3  14.4
 3 MX17004  2010     2     3  24.1  14.4
 4 MX17004  2010     2    11  29.7  13.4
 5 MX17004  2010     2    23  29.9  10.7
 6 MX17004  2010     3     5  32.1  14.2
 7 MX17004  2010     3    10  34.5  16.8
 8 MX17004  2010     3    16  31.1  17.6
 9 MX17004  2010     4    27  36.3  16.7
10 MX17004  2010     5    27  33.2  18.2
# … with 23 more rows

小总结

似乎在数据清理方面,代码可读性&简洁性都满分 ~ 其实 python 并不会比R高贵,不要抱怨为什么学校只教 r 作业也需要 r 完成,因为确实好用(如果你觉得难用,只是因为不熟悉或者不适应👏当然现阶段,面临面试笔试的我们,当然还是需要两个都掌握(听到一些同学对 R 的偏见之后的小感慨.....

最后放上李舰师兄的两句话,即使是coding初级选手,也要记得严格要求自己代码习惯呀,尤其是可读性方面: