Chapter 1 Tidy Data and data manipulation

1.1 Packages needed for this chapter

For this chapter you need to have the tidyverse package installed.

This chapter will explain what a tidy (Wickham and others 2014) database is and learn how to use functions from the dplyr (Wickham et al. 2022) package to manipulate data.

This class of the course can also be followed at this link.

1.2 Tidy data

A tidy database is a database in which (modified from (Leek 2015)):

  • Each variable to be measured must be in a column.
  • Each observation other than that variable must be in a different row.

In general, the way we would represent a tidy database in R is by using a data frame.

1.3 dplyr

The dplyr package is defined by its authors as a grammar for data manipulation. Thus their functions are known as verbs. A helpful summary of many of these features is found at this link.

This package has a large number of verbs and it would be difficult to see all of them in one class, in this chapter we will focus on its most used functions, which are:

  • %>% (pipelines)
  • group_by (group data)
  • summarize (summarize grouped data)
  • mutate (generate new variables)
  • filter (find rows with certain conditions)
  • select next to starts_with, ends_with or contains

in the next section you can learn about the pipeline (%>%), group_by and summarize

1.3.1 Pipeline (%>%)

The pipeline is an operator symbol %>% that is used to perform various operations sequentially without resorting to nested parentheses or overwriting multiple databases.

To see how this works as a vector, suppose you have a variable that you want to first get its logarithm, then its square root, and finally its average to two significant figures. To do this, the following must be followed:

x <- c(1, 4, 6, 8)
y <- round(mean(sqrt(log(x))), 2)

If pipelined, the code would be much neater. In that case, it would start with the object to be processed and then each of the functions with their arguments if necessary:

x <- c(1, 4, 6, 8)
y <- x %>%
    log() %>%
    sqrt() %>%
    mean() %>%
    round(2)
## [1] 0.99

Piped code is much easier to interpret at first glance since it reads from left to right and not from the inside out.

1.3.2 summarize

The summarize function takes the data from a data frame and summarizes it. To use this function, the first argument we would take would be a data frame, followed by the name we want to give to a summary variable, followed by the = sign and then the formula to apply to one or more columns. As an example we will use the iris database (Anderson 1935) that comes in R and of which we can see part of its data in the table 1.1

Table 1.1: a table with 10 rows from the iris database.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.8 4.0 1.2 0.2 setosa
4.7 3.2 1.6 0.2 setosa
5.1 3.8 1.9 0.4 setosa
5.2 2.7 3.9 1.4 versicolor
6.4 2.9 4.3 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
6.5 3.0 5.8 2.2 virginica
6.0 2.2 5.0 1.5 virginica
6.1 2.6 5.6 1.4 virginica
5.9 3.0 5.1 1.8 virginica

If we wanted to summarize that table and generate a couple of variables that were the mean and standard deviation of the length of the petal, we would do it with the following code:

library(tidyverse)
Summary.Petal <- summarize(iris, Mean.Petal.Length = mean(Petal.Length),
    SD.Petal.Length = sd(Petal.Length))

The result can be seen in the table 1.2, in which the averages and standard deviations of the lengths of the petals are obtained. It is important to note that when using summarize, all other variables will disappear from the table.

Table 1.2: Summary of the mean and standard deviation of the petal length of the flowers of the genus Iris.
Mean.Petal.Length SD.Petal.Length
3.758 1.765298

1.3.3 group_by

The group_by function itself does not cause any visible changes to the databases. However, when used in conjunction with summarize it allows you to summarize a grouped variable (usually) based on one or more categorical variables.

It can be seen that for the example with the case of plants of the genus Iris, the summary obtained in the case of the table 1.2 is not so useful considering that we have three species present. If you want to see the average length of the petal by species, you must use the group_by function as follows:

BySpecies <- group_by(iris, Species)
Summary.Byspecies <- summarize(BySpecies, Mean.Petal.Length = mean(Petal.Length),
    SD.Petal.Length = sd(Petal.Length))

This results in the table 1.3, from which you can see that Iris setosa has much shorter petals than the other two species of the same genus.

Table 1.3: Summary of the mean and standard deviation of the petal length of the flowers of the genus Iris.
Species Mean.Petal.Length SD.Petal.Length
setosa 1.462 0.1736640
versicolor 4.260 0.4699110
virginica 5.552 0.5518947

1.3.3.1 group_by on more than one variable

You can use the group_by function on more than one variable, and this would result in a nested summary. As an example we will use the mtcars database present in R (Henderson and Velleman 1981). This database features a variable called mpg (miles per gallon) and a measure of fuel efficiency. The information will be summarized based on the variable am (which refers to the type of transmission, where 0 is automatic and 1 is manual) and the number of engine cylinders. For that, the following code will be used:

Grouped <- group_by(mtcars, cyl, am)
Efficiency <- summarize(Grouped, Efficiency = mean(mpg))

As can be seen in the table 1.4, in all cases cars with manual transmissions have better fuel efficiency. You could try changing the order of the variables with which to group and observe the different results that can be obtained.

Table 1.4: Average miles per gallon in automatic (am = 0) and manual (am = 1) vehicles, with the different types of cylinders
cyl am Efficiency
4 0 22.90000
4 1 28.07500
6 0 19.12500
6 1 20.56667
8 0 15.05000
8 1 15.40000

1.3.4 mutate

This function aims to create new variables based on other variables. It is very easy to use, as an argument the name of the new variable that you want to create is used and an operation is performed with variables that are already there. For example, if we continue working with the Iris database, by creating a new variable that is the ratio between the length of the petal and the length of the sepal, the following results:

DF <- mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length)

The result of this operation is the table 1.5. The variable that has just been created will always appear at the end of the data frame.

Table 1.5: Table with ten of the observations from the new database with the new variable created with mutate
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Sepal.Ratio
5.8 4.0 1.2 0.2 setosa 0.21
4.7 3.2 1.6 0.2 setosa 0.34
5.1 3.8 1.9 0.4 setosa 0.37
5.2 2.7 3.9 1.4 versicolor 0.75
6.4 2.9 4.3 1.3 versicolor 0.67
5.5 2.5 4.0 1.3 versicolor 0.73
6.5 3.0 5.8 2.2 virginica 0.89
6.0 2.2 5.0 1.5 virginica 0.83
6.1 2.6 5.6 1.4 virginica 0.92
5.9 3.0 5.1 1.8 virginica 0.86

1.3.4.1 The pipeline in data frames

For example, we want to summarize the newly created variable of the ratio between the sepal and the petal. To do this, if starting from the original database, it would take several lines of code and the creation of multiple intermediate databases.

DF <- mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length)
BySpecies <- group_by(DF, Species)
Summary.Byspecies <- summarize(BySpecies, MEAN = mean(Petal.Sepal.Ratio),
    SD = sd(Petal.Sepal.Ratio))

Another option is to use nested parentheses, which results in the following code:

Summary.Byspecies <- summarize(group_by(mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length), Species), MEAN = mean(Petal.Sepal.Ratio), SD = sd(Petal.Sepal. Ratio))

This is further simplified by using the pipeline, which allows you to start at a Data Frame and then use the pipeline. This allows to obtain the same result as in the previous operations with the following code:

Summary.Byspecies <- iris %>%
    mutate(Petal.Sepal.Ratio = Petal.Length/Sepal.Length) %>%
    group_by(Species) %>%
    summarize(MEAN = mean(Petal.Sepal.Ratio), SD = sd(Petal.Sepal.Ratio))

These three codes are correct (table 1.6), but definitely the use of the pipeline gives the most concise and easy to interpret code without intermediate steps.

Table 1.6: Average petal-sepal ratio for the three Iris species
Species MEAN SD
setosa 0.2927557 0.0347958
versicolor 0.7177285 0.0536255
virginica 0.8437495 0.0438064

1.3.5 filter

This function allows you to select rows that meet certain conditions, such as having a value greater than a threshold or belonging to a certain class. The most typical symbols to use in this case are those seen in the table 1.7.

Table 1.7: R logical symbols and their meaning
symbol meaning cont_symbol cont_meaning
> Greater than != other than
< Less than %in% within group
== Equal to is.na is NA
>= greater than or equal to !is.na is not NA
<= less than or equal to | & or, and

For example, if you want to study the floral characteristics of plants of the genus Iris, but do not take into account the species Iris versicolor, you should use the following code:

data("iris")
DF <- iris %>%
    filter(Species != "versicolor") %>%
    group_by(Species) %>%
    summarise_all(mean)

This results in the table 1.8. In this case, the summarize_all function of summarize is introduced, which applies the function given as an argument to all variables in the database.

Table 1.8: Summary of the mean of all floral characteristics of the species Iris setosa and Iris virginica
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
virginica 6.588 2.974 5.552 2.026

On the other hand, if you want to study how many plants of each species have a petal length greater than 4 and a sepal length greater than 5, you should use the following code:

DF <- iris %>%
    filter(Petal.Length >= 4 & Sepal.Length >= 5) %>%
    group_by(Species) %>%
    summarise(N = n())

In the table table 1.9 it can be seen that with this filter all Iris setosa plants disappear from the database and that all except one Iris virginica plant have both characteristics.

Table 1.9: Number of plants of each species with a petal length greater than 4 and a sepal length greater than 5 centimeters
Species N
versicolor 39
virginica 49

1.3.6 select

This function allows you to select the variables to use since in many cases we will find databases with too many variables and therefore, we will want to reduce them to only work on a table with the necessary variables.

With select there are several ways to work, on the one hand you can write the variables that will be used, or subtract those that will not. In that sense these four codes give exactly the same result. This can be seen in the table 1.10

iris %>%
    group_by(Species) %>%
    select(Petal.Length, Petal.Width) %>%
    summarize_all(mean)
iris %>%
    group_by(Species) %>%
    select(-Sepal.Length, -Sepal.Width) %>%
    summarize_all(mean)
iris %>%
    group_by(Species) %>%
    select(contains("Petal")) %>%
    summarize_all(mean)
iris %>%
    group_by(Species) %>%
    select(-contains("Sepal")) %>%
    summarize_all(mean)
Table 1.10: Average petal length and petal width for species of the genus Iris
Species Petal.Length Petal.Width
setosa 1.462 0.246
versicolor 4.260 1.326
virginica 5.552 2.026

1.3.7 Exercises

1.3.7.1 Exercise 1

Using the storms database from the dplyr package, compute the average speed and average diameter (hu_diameter) of storms that have been declared hurricanes for each year.

1.3.7.2 Exercise 2

The ggplot2 package’s mpg database has fuel economy data in city miles per gallon (cty) for various vehicles. Obtain the data of vehicles from the year 2004 onwards that are compact and transform the efficiency Km/liter (1 mile = 1,609 km; 1 gallon = 3.78541 liters)

References

Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bulletin of the American Iris Society 59: 2–5.
Henderson, Harold V, and Paul F Velleman. 1981. “Building Multiple Regression Models Interactively.” Biometrics, 391–411.
Leek, Jeff. 2015. “The Elements of Data Analytic Style.” J. Leek.—Amazon Digital Services, Inc.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.