Chapter 2 Reproducible research

2.1 Packages needed for this chapter

For this chapter you need to have the rmarkdown, knitr, reprex and kableExtra packages installed.

This chapter will explain what reproducible research is, how to apply it using github plus the rmarkdown (Allaire et al. 2018) and knitr (Xie 2015) packages. Also, you will learn how to use tables using knitr (Xie 2015) and kableExtra (Zhu 2021). In order to use GitHub you need a GitPat, in order to get that, an easy option to generate it through the usethis package (Wickham, Bryan, and Barrett 2021). Finally you will learn how use reprex in order to generate reproducible questions in stackoverflow.

This class of the course can also be followed at this link.

2.2 Reproducible Research

Reproducible research is not the same as replicable research. Replicability implies that experiments or studies carried out under similar conditions will lead to similar conclusions. Reproducible research implies that from the same data and/or the same code the same results will be generated.

Continuum of reproducibility (extracted from Peng 2011)

Figure 2.1: Continuum of reproducibility (extracted from Peng 2011)

In the figure 2.1 we see the reproducibility continuum (Peng 2011). In this continuum we have the example of non-reproducibility as a post without code. It goes from less reproducible to more reproducible by the publication and the code that generated the results and graphs; followed by the publication, the code and the data that generate the results and graphs; and finally code, data and text intertwined in such a way that when running the code we get exactly the same publication that we read.

This has many advantages, including making it easier to apply the exact same methods to another database. It is enough to put the new database in the format that the author of the first publication had and we can compare the results.

Also, at a time when science is increasingly database-based, data collection and/or sampling can be put into code.

2.3 Saving our project on github

2.3.1 What is github?

Github is a kind of dropbox or google drive designed for reproducible research, where each project is a repository. Most researchers working on reproducible research leave all of their documented work in their repositories, allowing you to interact with other authors.

2.3.2 creating a github project in RStudio

To create a project on github we press start a project on the home page of our account, as we see in the figure 2.2

To start a project on github, you must press Start a project on your home page

Figure 2.2: To start a project on github, you must press Start a project on your home page

Then a unique name must be created, and without changing anything else press create repository on the green button as we see in the figure 2.3.

Create the name of your repository and press the button create repository

Figure 2.3: Create the name of your repository and press the button create repository

This will take you to a page where a url of your new repository will appear as in the figure 2.4

The content of the box that says ssh is the url of your repository

Figure 2.4: The content of the box that says ssh is the url of your repository

To incorporate your project into your repository, the first thing you need to do is generate a project in RStudio. For this you must go in the top menu of Rstudio to File > New Project > Git as seen in the figures 2.5 and 2.5.

Menu to create a new project

Figure 2.5: Menu to create a new project

Select git within the options

Figure 2.6: Select git within the options

Then select the location of the new project and paste the url that appears in the figure 2.4 in the space that says Repository URL:, as shown in the figure 2.7 .

Paste the repository url into the Repository URL: dialog

Figure 2.7: Paste the repository url into the Repository URL: dialog

When your R project is already following the changes on github, a git tab will appear in the upper right window of your RStudio session, as we see in the figure 2.8

Al incluir tu repositorio en tu sesión de Rstudio, aparecera la pestaña git en la ventana superior derecha

Figure 2.8: Al incluir tu repositorio en tu sesión de Rstudio, aparecera la pestaña git en la ventana superior derecha

2.3.3 The three main steps of a repository

Github is a whole world, there are many functions and there are experts in the use of github. In this course, we will focus on the 3 main steps of a repository: add, commit and push. To fully understand what each of these steps means, we have to understand that there are two repositories at all times: a local one (on your computer) and a remote one (on github.com). The first two steps add and commit only output changes to your local repository. While push saves the changes to the remote repository.

2.3.3.1 git add

This function is what adds files to your local repository. Only these files will be saved on github. Github have a repository size limit of 1 GB and files of 100 MB, since although they give you unlimited repositories, the space of each one is not, particularly in terms of databases. To add a file to your repository you just have to select the files in the git tab. By doing that a green letter A will appear instead of the two yellow question marks, as we see in the figure 2.9. In this case we only add the file Analisis.r to the repository but not the rest.

When including your repository in your Rstudio session, the git tab will appear in the upper right window

Figure 2.9: When including your repository in your Rstudio session, the git tab will appear in the upper right window

2.3.3.2 git commit

When you use the commit command you are saving the changes to the files you added to your local repository. To do this in Rstudio, in the same git tab, you must press the commit button as we see in the figure 2.10.

To save changes to your repository, press commit in the git tab in the upper right window

Figure 2.10: To save changes to your repository, press commit in the git tab in the upper right window

Pressing commit will open a pop-up window, where you will need to write a message describing what you will save. Once I’ve done that, press commit again in the popup as shown in figure 2.11.

Write a message that remembers the changes you made to the popup

Figure 2.11: Write a message that remembers the changes you made to the popup

2.3.3.3 git push

Finally, push will allow you to save changes to your remote repository, which secures your data in the cloud and also makes it available to other researchers. After pressing commit on the popup window (figure 2.11), we can press push on the green arrow on the popup window as seen in figure 2.12. Then we will be asked for our username and password, and we can check that our repository is online by entering our github session.

To save to the remote repository press push on the popup window

Figure 2.12: To save to the remote repository press push on the popup window

2.4 Reproducibility in R

There are several packages that allow us to do reproducible research in R, but without a doubt the most relevant are rmarkdown and knitr. Both packages work together when we generate an Rmd (Rmarkdown) file, in which we use text, R code and other elements at the same time to generate a word document, pdf, web page, presentation and/or web application (fig. 2.13).

The goal of Rmarkdown is to merge r code with text and data to generate a reproducible document

Figure 2.13: The goal of Rmarkdown is to merge r code with text and data to generate a reproducible document

2.4.1 Creating an Rmarkdown

To create an Rmarkdown file, simply go to the menu File > New file > Rmarkdown and with that you will have created a new Rmd file. We will see some of the most typical elements of an Rmarkdown file.

2.4.1.1 Markdown

The markdown is the part of the file where we simply write text, although it has some formatting details such as generating bold text, italics, titles and subtitles.

To make text bold, it must be between two **bold** asterisks, for text to appear italic it must be between *italic* asterisks. Other examples are the titles of different levels, which are denoted with different numbers of #, as well as the following 4 titles or subtitles:

subtitle 1

subtitle 2

subtitle 3

subtitle 4

it would look like this in code

## subtitle 1

### subtitle 2

#### subtitle 3

##### subtitle 4

2.4.1.2 Chunks

Chunks are one of the most important parts of an Rmarkdown. These are where the code from R (or other programming languages) is added. Which allows the product of our code not to be just a script with pasted results, but actually generated in the same document as our script. The easiest way to add a chunk is by pressing the insert chunk button in Rstudio, this button is located in the upper left window of our RStudio session, as shown in the figure 2.14

When pressing the insert chunk button, a space will appear in which to insert code

Figure 2.14: When pressing the insert chunk button, a space will appear in which to insert code

By pressing this button a space will appear, there you can add a code like the one below, and see the results below.

```{r}
library(tidyverse)
iris %>% group_by(Species) %>% summarize(Petal.Length = mean(Petal.length))
```
## # A tibble: 3 × 2
##   Species    Petal.Length
##   <fct>             <dbl>
## 1 setosa             1.46
## 2 versicolor         4.26
## 3 virginica          5.55
2.4.1.2.1 Chunk Options

There are many options for chunks, a complete documentation can be found at the following link, but here we will show the most common:

  • echo = T or F show or not the code, respectively
  • message = T or F displays packet messages, respectively
  • warning = T or F displays warnings, respectively
  • eval = T or F to evaluate or not the code, respectively
  • cache = T or F save or not the result, respectively

2.4.1.3 inline code

The inline codes are useful to add some value in the text, such as the value of p or the mean. To use it, put a backtick (backsingle quote), r, the code in question and another backtick as follows `r R_code`. We cannot put anything in an inline code, since it can only generate vectors, which often requires a lot of creativity to achieve what we want. For example if we wanted to put the average sepal length of the iris database into an inline code we would put `r mean(iris$Sepal.Length)`, which would result in 5.8433333. As a number with 7 significant figures would look strange in a text, we would also like to use the round function, so that it has 2 significant figures, for that we put the following inline code `r round(mean( iris$Sepal.Length),2)` which returns 5.84. This can be made even more complex if you want to work with a summary table. For example, if we wanted to list the average sepal size we would use summarize from dplyr, but this would result in a data.frame, which doesn’t appear if we try to inline code. Let’s start by seeing how the code would look where we obtained the average size of the sepal.

iris %>%
    group_by(Species) %>%
    summarize(Mean = mean(Sepal.Length))

We would see the result of that code 2.1

Table 2.1: Summary of the average length of the sepal of the flowers of the genus Iris.
Species Mean
setosa 5.006
versicolor 5.936
virginica 6.588

To remove the mean vector from this data frame we could subset it with the $ sign. So if we want to output the Mean column of the data frame we created as a vector, we would do the following `r (iris %>% group_by(Species) %>% summarize(Mean = mean(Sepal.Length )))$Mean`. This would output 5.006, 5.936, 6.588.

2.4.2 Training

2.4.2.1 Exercise 1

Using the iris database, create an inline code that tells the average length of the petal of the species Iris virginica

2.4.2.2 Tables in Rmarkdown

The most typical function for generating tables in an rmd file is kable from the knitr package, which in its simplest form includes a dataframe as its only argument. In addition to this, we can add some parameters such as caption, which allows us to put a title to the table or row.names, which if put as seen in the code (FALSE) will not show the names of the rows, as seen in the table 2.2.

DF <- iris %>%
    group_by(Species) %>%
    summarize_all(mean)
kable(DF, caption = "Average per species of all iris database variables.",
    row.names = FALSE)
Table 2.2: Average per species of all iris database variables.
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

References

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, and Winston Chang. 2018. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.
Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27.
Wickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2021. Usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.