Chapter 7 Tidyverse

In this chapter we will go over how to use tidyverse for data cleaning and wrangling tasks. You will learn about tidyverse grammar and operators, as well as specific functions.

7.1 Introduction

When we learn about R and programming, people tend to get lost or overwhelmed… or both. In order to help with that, approach learning R as you would when you started to learn French in school. Essentially, everything we have taught you in the previous chapters are basic Hello, Goodbye, Where is the library, type of phrases. Stuff to get you by!

Now we can teach you more advanced tasks, help you become more fluent in the R language.

tidyverse is a compilation of different R packages, like the ones you have installed and loaded with install.packages() and library() functions. Within tidyverse, it recognizes its’ own type of grammar to navigate and utilize the functions in your code.

Relate this to Quebec French versus Acadian or France French.

Lets load in tidyverse and talk about the basics.

7.1.1 Installation and Set-Up

To install and load tidyverse packages, all you have to do is ask R to install using the install.packages() function.

# Install Tidyverse like this
install.packages('tidyverse')

# Remember - You only need to install it ONCE

# Load in the packages like this
library(tidyverse) # loads them all at once!

Note: From this point on, it would be a good habit to include library(tidyverse) in every Load Libraries section of your scripts because these packages are used often.

Now that we have tidyverse installed and loaded, we can talk about the grammar.

7.1.2 The Pipe

A pipe in R looks like %>% or |>, we will focus on using %>%. It represents a chain of commands or a pipeline.

It allows you to work within your data without having to start at the beginning of the pipeline, every single step. This is also called nesting functions.

To show you an example, we will compare two different ways to find the sum of Sepal.Length from the iris dataset, and then find the square root of that number.

# Here is how to do it without using tidyverse grammar - nested functions
sqrt(sum(iris$Sepal.Length))

# Here is how to do it in tidyverse using a pipe
iris$Sepal.Length %>% 
  sum() %>% 
  sqrt()

Note: To help, when you read your code, say THEN in place of the pipes.

To create a pipe, you can manually enter each of the characters or

Windows users can press ctrl + shift + m

Mac users can press command + shift + m

It is important to remember that if you don’t have tidyverse loaded in your script, R will NOT recognize %>% as a function and you will receive an error message.

7.1.3 Assigning Objects

In the data importing section you may have noticed the use of <-. This is an assignment operator. Think of it as is when you read code out loud.

To create a new data frame, list, or vector in R, you have to assign it to an object in order to refer to it again.

A good trick to make sure your data frame is now in your environment is to look in the console tab in bottom left pane and if you see the output, it is not saved as an object in the environment.

You can also look in the environment tab in the top right pane and if you see the object name, it is saved in the environment and you can refer to it again.

7.1.4 Cleaning & Wrangling

Since we have told you the basics for using Excel to enter, store, and do some basic data cleaning, we are going to show you how to do this entirely in R.

Before we start, we need to discuss a SUPER important rule of thumb…

Raw Data Stays Raw

In previous chapters we talked about the basic best practices in excel to enter and format your spreadsheets, and to document your metadata or the context.

When you are cleaning or wrangling your data, you should never change or work on the original dataset. This represents your raw data. You always want to be able to refer back to the original, just in case something goes wrong, or you want to double check something.

When working in Excel, saving a copy of your data is a great way to make sure raw data stays raw. In R, this comes when exporting your data. This is because when cleaning data in R the data is imported to use, in order to save data you need to tell R to do so.

Therefore, after cleaning data in R you should save the cleaned version under a different name as the raw data.

Example:

Raw Data: research_data_raw.csv.

Cleaned Version: research_data_clean.csv.

This will help you stay organized, minimize the chances to lose data, and give you opportunities to fix potential errors in your wrangling process.

7.2 DPLYR

dplyr is a package used to do most data cleaning tasks. It is the grammar of data manipulation.

All of the these functions follow tidyverse pipe grammar and are consistent and compatible with other packages.

We will cover some of the most commonly used functions within this package!

7.2.1 Filter

The filter() function does exactly what you would expect. It filters your data and lets you choose what you want to display

When using filter(), there are different ways to specify how you want your data to be filtered. Boolean Algebra is what this is called.

Operation	Meaning
`==`	is equal too
`!`	not
`!=`	is not equal
`>` or `<`	less/greater
`>=`	greater inclusive
`,` and `&`	and
`\|`	or
`%in%`	is one of

Let’s see some examples!

# Let's show only the setosa species AND a Sepal.Length of greater than 5.5
iris %>% 
  filter(Species == "setosa", Sepal.Length > 5.5)

# What about setosa species OR a sepal length of greater than 5.5
iris %>% 
  filter(Species == "setosa" | Sepal.Length > 5.5)

# Let's try setosa AND virginica
iris %>% 
  filter(Species %in% c("setosa", "virginica") & Sepal.Length > 5.5)

# Removing one species?
iris %>% 
  filter(Species != "setosa")

Removing or isolating NA values is a little different. To do this, you need to ask if a value is NA and then specify which answer to include/exclude.

Like this:

# Filter rows where Sepal.Length is not an NA
iris %>% 
  filter(is.na(Sepal.Length) == FALSE)

7.2.2 Arrange

arrange() lets you order the rows by a variable in ascending or descending order.

# Arrange data by Sepal.Length
iris %>% 
  arrange(Sepal.Length)

# Let's filter the species first, and change the order to show biggest to smallest
iris %>% 
  filter(Species == "virginica") %>% 
  arrange(desc(Sepal.Length))

7.2.3 Select and Rename

select() and rename() are two more functions that exactly what their names imply. select() lets you pick and choose what variables to include, and rename() lets you change the name of the variables.

Here are some ways to use select() and rename() functions.

# Select only the length variables for a species
iris %>% 
  select(Species, Sepal.Length, Petal.Length)

# What about removing a variable
iris %>% 
  select(-Petal.Width, -Sepal.Width)

# Rename the variables to follow our naming conventions
iris %>% 
  rename(species = Species, sepal_length = Sepal.Length, sepal_width = Sepal.Width, petal_length = Petal.Length, petal_width = Petal.Width)

You could also do both things in the select() function.

# Select a couple variables AND rename then
iris %>% 
  select(species = Species, sepal_length = Sepal.Length, petal_length = Petal.Length)

7.2.4 Mutate

mutate() is a function that allows you to change existing variables, or create new ones.

# Multiply the sepal length by 30
iris %>% 
  mutate(Sepal.Length = Sepal.Length * 30)

# Create a new variable
iris %>% 
  mutate(sepal_length_new = Sepal.Length * 30)

# Create and modify multiple variables
iris %>% 
  mutate(sepal_length_new = Sepal.Length * 30,
         Petal.Length = Petal.Length *30,
         random.variable = "you can do text AND stats too",
         mean_sepal_length = mean(Sepal.Length)
  )

7.2.5 Case When

case_when() lets you create and modify variables based on conditions within the variable or other variables.

Think of case_when() as different if else statements. If this variable is this, then do this. If the variable is that, then do that. if not, then do this instead.

Here’s some examples to explain:

# Example 1
iris %>% 
  mutate(sepal_length_new = case_when(Species == "setosa" ~ Sepal.Length * 10,
                                      Species == "virginica" ~ Sepal.Length * 1000,
                                      TRUE ~ Sepal.Length)
  )

# Example 2
iris %>% 
  mutate(Species = case_when(Species %in% c("setosa", "virginica") ~ "not relevant",
                             TRUE ~ Species)
  )

Try to read the code and figure out what each example is doing.

7.2.6 Conclusion

Let’s try an example where multiple dplyr functions are used within a pipeline.

iris %>% 
  select(Species, Sepal.Length) %>% 
  filter(Species == "setosa", Sepal.Length > 5) %>% 
  arrange(Sepal.Length) %>% 
  mutate(crazy_sepal_length = Sepal.Length * 400) %>% 
  rename(original_sepal_length = Sepal.Length)

What happened at each step in the pipeline? Write it out and use THEN when you move to the next line of code.

7.3 Lubridate

lubridate is another tidyverse package used for working with dates. Because working with dates isn’t something that is globally agreed on, and very common that it needs its own package just to cover everything that could happen.

Let’s run through a demo entirely in R script to get you familiar with dates and the demo format all together.

# create a date and a time string (a series of values that can be
# converted to a date and time easily)
mydata <- mydata %>%
  mutate(mydate = paste(year, month, day, sep = "-"),
         mytime = paste(hour, minute, second, sep = ":"))

# View those variables
mydata$mydate
mydata$mytime

# Use the lubridate cheatsheet and dplyr 
#to create the following variables or look at times

#1) what is the current datetime?
now()
Sys.time()

#2) make a datetime variable
mydata <- mydata%>%
  mutate(
    dt = paste(paste(year, month, day, sep="-"),
                  paste(hour,minute,second,sep=":"), sep=" "),
         datetime=ymd_hms(dt))
names(mydata)
mydata$datetime #check datetime variable 

#3) make a datetime variable with timezone of 
#Canada/Eastern
#hint: use OlsonNames() to figure it out
OlsonNames()

#Timezone is EST or EDT depending on date!
mydata<-mydata%>%
  mutate(datetime=with_tz(datetime, tzone = "Canada/Eastern"))
mydata$datetime #check datetime variable 

#Timezone is fixed to EST for all datetimes!
mydata<-mydata%>%
  mutate(datetime=with_tz(datetime, tzone = "EST"))
mydata$datetime #check datetime variable 


#4) change (3) to a timezone of UTC
#Timezone is fixed to UTC for all datetimes!
mydata<-mydata%>%
  mutate(datetime=with_tz(datetime, tzone = "UTC"))
mydata$datetime #check datetime variable 

#5) convert (3) to the following format
#28 Jan 2022 13:44:00
format(mydata$datetime)

mydata<-mydata%>%
  mutate(datetime2=format(datetime,"%d %b %Y %H:%M:%S"))
mydata$datetime2 #check datetime variable 

#6) create a variable that shows what week the datetime is in
#Week as a number
mydata<-mydata%>%
  mutate(isweek=week(datetime))
mydata$isweek

#Day of week as a label
mydata<-mydata%>%
  mutate(isdayofweek=wday(datetime, label=TRUE))
mydata$isdayofweek

#7) create a variable of just the timezone
mydata<-mydata%>%
  mutate(timezone=tz(datetime))
mydata$timezone

7.4 Data Summaries & Tables

7.4.1 Introduction

A common task in research analyses is to summarize your data and create different tables to use within a manuscript or thesis. When using R, you can generate summary information accross different groups easily and very quickly. No manual manipulation or time consuming calculations required!

7.4.2 Summarizing Data

In order to summaraize your data based on groups, we will use the group_by() and summarise() functions from tidyverse!

group_by() is telling R that you want to work in groups determined by the different values within variables. You can make these groups specified by 1 or multiple variables in your dataset! This means that every unique combination of values found in the columns specified within the group_by() function represents its own group.

After telling R what the groups are, you then will use the summarise() function to create or modify variables, similar to the mutate() function!

Here are some examples:

# Example 1: Mean Length by species
mydata %>% 
  group_by(common_name) %>% 
  summarise(tl_mm = mean(tl_mm, na.rm = TRUE),
            fl_mm = mean(fl_mm, na.rm = TRUE))

# Example 2: Mean Length by species, and year
mydata %>% 
  group_by(year, common_name) %>% 
    summarise(tl_mm = mean(tl_mm, na.rm = TRUE),
            fl_mm = mean(fl_mm, na.rm = TRUE))

# Example 3: general statistics for species
mydata %>% 
  group_by(common_name) %>% 
  summarise(max_tl_mm = max(tl_mm),
            min_tl_mm = min(tl_mm),
            mean_tl_mm = mean(tl_mm),
            standard_deviation = sd(tl_mm))

Note: unless you are still working within your grouping variables, you should always ungroup() after you complete your summary

7.5 Chapter Wrap-Up

7.5.1 Chapter Terms & Definitions

Here is a summary of some of the bolded terms used throughout this chapter, refer back to this list whenever you need a refresher!

Coming soon!

7.5.2 Additional Links & Resources

If you are still confused, or you would like more information on topics discussed in this chapter, check out the following links:

Intro to Tidyverse

Short Tidyverse Intro

Data Manipulation with DPLYR

Package Cheat Sheets