Tidy Tutorial 1: Creating some data

This is a quick tutorial to introduce some of the basic tidyverse functions by creating some fake data

The design of this tutorial is meant to provide an intuitive understanding of the “pipe” operator: ‘%>%’


####Load the tidyverse library

library(tidyverse)

####Create lists to use as variables

var1<-rep(seq(1, 10, 1), 10) 
#Create a variable consisting of 10 sequential sequences of 1:10

var2<-rep(seq(0, 1, 1), 50)  
#Create a binary 0/1 variable

####Combine these variables into a tibble

dat<-as_tibble(cbind(var1, var2))        
# (a 'tibble' is similar to a data frame)

####View the first ten rows.

head(dat, 10)
## # A tibble: 10 x 2
##     var1  var2
##    <dbl> <dbl>
##  1     1     0
##  2     2     1
##  3     3     0
##  4     4     1
##  5     5     0
##  6     6     1
##  7     7     0
##  8     8     1
##  9     9     0
## 10    10     1

####Alright, we have now created a tibble (saved as “dat”). It consists of 2 columns and 100 rows


##Tidyverse part 1: The pipe operator and the ‘select’ + ‘arrange’ functions


####Let’s say we want var1 to be our ID variable, so that 1-10 each represent a unique individual

####To do this, we can start by changing the name of “var1” to “id”" within the select function

dat<-dat %>% select(id=var1, var2) 
#Here we are introducing the pipe ' %>% 'operator

dat
## # A tibble: 100 x 2
##       id  var2
##    <dbl> <dbl>
##  1     1     0
##  2     2     1
##  3     3     0
##  4     4     1
##  5     5     0
##  6     6     1
##  7     7     0
##  8     8     1
##  9     9     0
## 10    10     1
## # … with 90 more rows

####We can also change the name of var2 in a similar manner: let’s name it “group”, which will act as a binary variable

dat<-dat %>% select(id, group=var2) 
#Notice that the new name comes first in the expression 

dat
## # A tibble: 100 x 2
##       id group
##    <dbl> <dbl>
##  1     1     0
##  2     2     1
##  3     3     0
##  4     4     1
##  5     5     0
##  6     6     1
##  7     7     0
##  8     8     1
##  9     9     0
## 10    10     1
## # … with 90 more rows

####Ok, now let’s say we want to rearrange our id column so that each individuals rows are ‘stacked’ on top of each other

dat<-dat %>% arrange(id)   
#This is the arrange function

dat     
## # A tibble: 100 x 2
##       id group
##    <dbl> <dbl>
##  1     1     0
##  2     1     0
##  3     1     0
##  4     1     0
##  5     1     0
##  6     1     0
##  7     1     0
##  8     1     0
##  9     1     0
## 10     1     0
## # … with 90 more rows

##Tidyverse part 2: The ‘mutate’ function and the ‘group_by’ function


####Now lets say we want to create a new variable. Here we will introduce the mutate function

dat_not_grouped<-dat %>% mutate(time = seq(0, 99)) 
#Here mutate is creating a new variable called "time"

####If you inspect the data in “dat_not_grouped”, you will see that we have created a time variable that goes from 0-99

####What we actually want to do here is create a time variable that goes in order from 0-9 for EACH ID


####In order to do this, we will combine the mutate function with the group_by function

dat_grouped<-dat %>% group_by(id) %>% mutate(time = seq(0, 9))

####Alright, now lets create a continuous random variable called ‘score’, grouped within people

dat<-dat_grouped %>%  
  group_by(id) %>%                            
  mutate(score = rnorm(10, 0, 1.5)) %>%      
  mutate(score=round(score, 2))      

#You can stack mutate functions vertically for better readability

dat                                  
## # A tibble: 100 x 4
## # Groups:   id [10]
##       id group  time score
##    <dbl> <dbl> <int> <dbl>
##  1     1     0     0  0.01
##  2     1     0     1 -0.77
##  3     1     0     2  0.13
##  4     1     0     3 -0.96
##  5     1     0     4 -3.24
##  6     1     0     5  1.07
##  7     1     0     6  1.17
##  8     1     0     7 -0.13
##  9     1     0     8  1.09
## 10     1     0     9 -0.09
## # … with 90 more rows

##Tidyverse part 3: The filter function


####Here, let’s view id == 5. We will do this using the filter function

dat %>% filter(id==5)                                 
## # A tibble: 10 x 4
## # Groups:   id [1]
##       id group  time score
##    <dbl> <dbl> <int> <dbl>
##  1     5     0     0  1.63
##  2     5     0     1 -1.35
##  3     5     0     2 -0.07
##  4     5     0     3 -1.39
##  5     5     0     4  0.92
##  6     5     0     5  0.26
##  7     5     0     6 -1.51
##  8     5     0     7  0.96
##  9     5     0     8 -2.6 
## 10     5     0     9  1.8

####We can also use filter to subset only the data from “group = 1”

group_1_dat<-dat %>% filter(group==1)                                 
#View the data for only group 1

##Review: ‘select’, ‘arrange’, ‘mutate’, ‘group_by’, ‘filter’


####Above we created a tibble (“dat”) and then applied functions to ‘dat’ using the pipe operator

####For review of these functions, I will create a ‘group_0_dat’ by applying each of these functions at once

dat2<-as_tibble(cbind(var1, var2))   
#This is the same as the original dat variable

group_0_dat<-dat2 %>% 
  select(id=var1, group=var2) %>% 
  arrange(id) %>% 
  group_by(id) %>% 
  mutate(time = seq(0, 9)) %>%
  mutate(score = round(rnorm(10, 0, 1.5), 2)) %>% 
  filter(group==0) 

head(group_0_dat)                   
## # A tibble: 6 x 4
## # Groups:   id [1]
##      id group  time score
##   <dbl> <dbl> <int> <dbl>
## 1     1     0     0  0.93
## 2     1     0     1 -1.93
## 3     1     0     2 -2.16
## 4     1     0     3 -0.41
## 5     1     0     4 -0.03
## 6     1     0     5  0.5

##Writing out the tibble “dat”


#####We can now save our “dat” tibble as a csv file using ‘write.csv’

write.csv(dat, file="Tidy1_dat.csv", row.names = FALSE)

Avatar
PJ Ryan
Doctoral student in HDFS