Chapter 1 Tidy Data and data manipulation
1.1 Packages needed for this chapter
For this chapter you need to have the tidyverse package installed.
This chapter will explain what a tidy (Wickham and others 2014) database is and learn how to use functions from the dplyr (Wickham et al. 2022) package to manipulate data.
This class of the course can also be followed at this link.
1.2 Tidy data
A tidy database is a database in which (modified from (Leek 2015)):
- Each variable to be measured must be in a column.
- Each observation other than that variable must be in a different row.
In general, the way we would represent a tidy database in R
is by using a data frame.
1.3 dplyr
The dplyr package is defined by its authors as a grammar for data manipulation. Thus their functions are known as verbs. A helpful summary of many of these features is found at this link.
This package has a large number of verbs and it would be difficult to see all of them in one class, in this chapter we will focus on its most used functions, which are:
- %>% (pipelines)
- group_by (group data)
- summarize (summarize grouped data)
- mutate (generate new variables)
- filter (find rows with certain conditions)
- select next to starts_with, ends_with or contains
in the next section you can learn about the pipeline (%>%), group_by and summarize
1.3.1 Pipeline (%>%)
The pipeline is an operator symbol %>%
that is used to perform various operations sequentially without resorting to nested parentheses or overwriting multiple databases.
To see how this works as a vector, suppose you have a variable that you want to first get its logarithm, then its square root, and finally its average to two significant figures. To do this, the following must be followed:
<- c(1, 4, 6, 8)
x <- round(mean(sqrt(log(x))), 2) y
If pipelined, the code would be much neater. In that case, it would start with the object to be processed and then each of the functions with their arguments if necessary:
<- c(1, 4, 6, 8)
x <- x %>%
y log() %>%
sqrt() %>%
mean() %>%
round(2)
## [1] 0.99
Piped code is much easier to interpret at first glance since it reads from left to right and not from the inside out.
1.3.2 summarize
The summarize
function takes the data from a data frame and summarizes it. To use this function, the first argument we would take would be a data frame, followed by the name we want to give to a summary variable, followed by the = sign and then the formula to apply to one or more columns. As an example we will use the iris
database (Anderson 1935) that comes in R
and of which we can see part of its data in the table 1.1
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.8 | 4.0 | 1.2 | 0.2 | setosa |
4.7 | 3.2 | 1.6 | 0.2 | setosa |
5.1 | 3.8 | 1.9 | 0.4 | setosa |
5.2 | 2.7 | 3.9 | 1.4 | versicolor |
6.4 | 2.9 | 4.3 | 1.3 | versicolor |
5.5 | 2.5 | 4.0 | 1.3 | versicolor |
6.5 | 3.0 | 5.8 | 2.2 | virginica |
6.0 | 2.2 | 5.0 | 1.5 | virginica |
6.1 | 2.6 | 5.6 | 1.4 | virginica |
5.9 | 3.0 | 5.1 | 1.8 | virginica |
If we wanted to summarize that table and generate a couple of variables that were the mean and standard deviation of the length of the petal, we would do it with the following code:
library(tidyverse)
<- summarize(iris, Mean.Petal.Length = mean(Petal.Length),
Summary.Petal SD.Petal.Length = sd(Petal.Length))
The result can be seen in the table 1.2, in which the averages and standard deviations of the lengths of the petals are obtained. It is important to note that when using summarize, all other variables will disappear from the table.
Mean.Petal.Length | SD.Petal.Length |
---|---|
3.758 | 1.765298 |
1.3.3 group_by
The group_by
function itself does not cause any visible changes to the databases. However, when used in conjunction with summarize
it allows you to summarize a grouped variable (usually) based on one or more categorical variables.
It can be seen that for the example with the case of plants of the genus Iris, the summary obtained in the case of the table 1.2 is not so useful considering that we have three species present. If you want to see the average length of the petal by species, you must use the group_by
function as follows:
<- group_by(iris, Species)
BySpecies <- summarize(BySpecies, Mean.Petal.Length = mean(Petal.Length),
Summary.Byspecies SD.Petal.Length = sd(Petal.Length))
This results in the table 1.3, from which you can see that Iris setosa has much shorter petals than the other two species of the same genus.
Species | Mean.Petal.Length | SD.Petal.Length |
---|---|---|
setosa | 1.462 | 0.1736640 |
versicolor | 4.260 | 0.4699110 |
virginica | 5.552 | 0.5518947 |
1.3.3.1 group_by on more than one variable
You can use the group_by
function on more than one variable, and this would result in a nested summary. As an example we will use the mtcars
database present in R (Henderson and Velleman 1981). This database features a variable called mpg (miles per gallon) and a measure of fuel efficiency. The information will be summarized based on the variable am (which refers to the type of transmission, where 0 is automatic and 1 is manual) and the number of engine cylinders. For that, the following code will be used:
<- group_by(mtcars, cyl, am)
Grouped <- summarize(Grouped, Efficiency = mean(mpg)) Efficiency
As can be seen in the table 1.4, in all cases cars with manual transmissions have better fuel efficiency. You could try changing the order of the variables with which to group and observe the different results that can be obtained.
cyl | am | Efficiency |
---|---|---|
4 | 0 | 22.90000 |
4 | 1 | 28.07500 |
6 | 0 | 19.12500 |
6 | 1 | 20.56667 |
8 | 0 | 15.05000 |
8 | 1 | 15.40000 |
1.3.4 mutate
This function aims to create new variables based on other variables. It is very easy to use, as an argument the name of the new variable that you want to create is used and an operation is performed with variables that are already there. For example, if we continue working with the Iris database, by creating a new variable that is the ratio between the length of the petal and the length of the sepal, the following results:
<- mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length) DF
The result of this operation is the table 1.5. The variable that has just been created will always appear at the end of the data frame.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Petal.Sepal.Ratio |
---|---|---|---|---|---|
5.8 | 4.0 | 1.2 | 0.2 | setosa | 0.21 |
4.7 | 3.2 | 1.6 | 0.2 | setosa | 0.34 |
5.1 | 3.8 | 1.9 | 0.4 | setosa | 0.37 |
5.2 | 2.7 | 3.9 | 1.4 | versicolor | 0.75 |
6.4 | 2.9 | 4.3 | 1.3 | versicolor | 0.67 |
5.5 | 2.5 | 4.0 | 1.3 | versicolor | 0.73 |
6.5 | 3.0 | 5.8 | 2.2 | virginica | 0.89 |
6.0 | 2.2 | 5.0 | 1.5 | virginica | 0.83 |
6.1 | 2.6 | 5.6 | 1.4 | virginica | 0.92 |
5.9 | 3.0 | 5.1 | 1.8 | virginica | 0.86 |
1.3.4.1 The pipeline in data frames
For example, we want to summarize the newly created variable of the ratio between the sepal and the petal. To do this, if starting from the original database, it would take several lines of code and the creation of multiple intermediate databases.
<- mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length)
DF <- group_by(DF, Species)
BySpecies <- summarize(BySpecies, MEAN = mean(Petal.Sepal.Ratio),
Summary.Byspecies SD = sd(Petal.Sepal.Ratio))
Another option is to use nested parentheses, which results in the following code:
<- summarize(group_by(mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length), Species), MEAN = mean(Petal.Sepal.Ratio), SD = sd(Petal.Sepal. Ratio)) Summary.Byspecies
This is further simplified by using the pipeline, which allows you to start at a Data Frame and then use the pipeline. This allows to obtain the same result as in the previous operations with the following code:
<- iris %>%
Summary.Byspecies mutate(Petal.Sepal.Ratio = Petal.Length/Sepal.Length) %>%
group_by(Species) %>%
summarize(MEAN = mean(Petal.Sepal.Ratio), SD = sd(Petal.Sepal.Ratio))
These three codes are correct (table 1.6), but definitely the use of the pipeline gives the most concise and easy to interpret code without intermediate steps.
Species | MEAN | SD |
---|---|---|
setosa | 0.2927557 | 0.0347958 |
versicolor | 0.7177285 | 0.0536255 |
virginica | 0.8437495 | 0.0438064 |
1.3.5 filter
This function allows you to select rows that meet certain conditions, such as having a value greater than a threshold or belonging to a certain class. The most typical symbols to use in this case are those seen in the table 1.7.
symbol | meaning | cont_symbol | cont_meaning |
---|---|---|---|
> | Greater than | != | other than |
< | Less than | %in% | within group |
== | Equal to | is.na | is NA |
>= | greater than or equal to | !is.na | is not NA |
<= | less than or equal to | | & | or, and |
For example, if you want to study the floral characteristics of plants of the genus Iris, but do not take into account the species Iris versicolor, you should use the following code:
data("iris")
<- iris %>%
DF filter(Species != "versicolor") %>%
group_by(Species) %>%
summarise_all(mean)
This results in the table 1.8. In this case, the summarize_all
function of summarize
is introduced, which applies the function given as an argument to all variables in the database.
Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|---|
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
On the other hand, if you want to study how many plants of each species have a petal length greater than 4 and a sepal length greater than 5, you should use the following code:
<- iris %>%
DF filter(Petal.Length >= 4 & Sepal.Length >= 5) %>%
group_by(Species) %>%
summarise(N = n())
In the table table 1.9 it can be seen that with this filter all Iris setosa plants disappear from the database and that all except one Iris virginica plant have both characteristics.
Species | N |
---|---|
versicolor | 39 |
virginica | 49 |
1.3.6 select
This function allows you to select the variables to use since in many cases we will find databases with too many variables and therefore, we will want to reduce them to only work on a table with the necessary variables.
With select there are several ways to work, on the one hand you can write the variables that will be used, or subtract those that will not. In that sense these four codes give exactly the same result. This can be seen in the table 1.10
%>%
iris group_by(Species) %>%
select(Petal.Length, Petal.Width) %>%
summarize_all(mean)
%>%
iris group_by(Species) %>%
select(-Sepal.Length, -Sepal.Width) %>%
summarize_all(mean)
%>%
iris group_by(Species) %>%
select(contains("Petal")) %>%
summarize_all(mean)
%>%
iris group_by(Species) %>%
select(-contains("Sepal")) %>%
summarize_all(mean)
Species | Petal.Length | Petal.Width |
---|---|---|
setosa | 1.462 | 0.246 |
versicolor | 4.260 | 1.326 |
virginica | 5.552 | 2.026 |