Chapter 2 Reproducible research
2.1 Packages needed for this chapter
For this chapter you need to have the rmarkdown, knitr, reprex and kableExtra packages installed.
This chapter will explain what reproducible research is, how to apply it using github plus the rmarkdown (Allaire et al. 2018) and knitr (Xie 2015) packages. Also, you will learn how to use tables using knitr (Xie 2015) and kableExtra (Zhu 2021). In order to use GitHub you need a GitPat, in order to get that, an easy option to generate it through the usethis package (Wickham, Bryan, and Barrett 2021). Finally you will learn how use reprex in order to generate reproducible questions in stackoverflow.
This class of the course can also be followed at this link.
2.2 Reproducible Research
Reproducible research is not the same as replicable research. Replicability implies that experiments or studies carried out under similar conditions will lead to similar conclusions. Reproducible research implies that from the same data and/or the same code the same results will be generated.
In the figure 2.1 we see the reproducibility continuum (Peng 2011). In this continuum we have the example of non-reproducibility as a post without code. It goes from less reproducible to more reproducible by the publication and the code that generated the results and graphs; followed by the publication, the code and the data that generate the results and graphs; and finally code, data and text intertwined in such a way that when running the code we get exactly the same publication that we read.
This has many advantages, including making it easier to apply the exact same methods to another database. It is enough to put the new database in the format that the author of the first publication had and we can compare the results.
Also, at a time when science is increasingly database-based, data collection and/or sampling can be put into code.
2.3 Saving our project on github
2.3.1 What is github?
Github is a kind of dropbox or google drive designed for reproducible research, where each project is a repository. Most researchers working on reproducible research leave all of their documented work in their repositories, allowing you to interact with other authors.
2.3.2 creating a github project in RStudio
To create a project on github we press start a project on the home page of our account, as we see in the figure 2.2
Then a unique name must be created, and without changing anything else press create repository on the green button as we see in the figure 2.3.
This will take you to a page where a url of your new repository will appear as in the figure 2.4
To incorporate your project into your repository, the first thing you need to do is generate a project in RStudio. For this you must go in the top menu of Rstudio to File > New Project > Git as seen in the figures 2.5 and 2.5.
Then select the location of the new project and paste the url that appears in the figure 2.4 in the space that says Repository URL:, as shown in the figure 2.7 .
When your R project is already following the changes on github, a git tab will appear in the upper right window of your RStudio session, as we see in the figure 2.8
2.3.3 The three main steps of a repository
Github is a whole world, there are many functions and there are experts in the use of github. In this course, we will focus on the 3 main steps of a repository: add, commit and push. To fully understand what each of these steps means, we have to understand that there are two repositories at all times: a local one (on your computer) and a remote one (on github.com). The first two steps add and commit only output changes to your local repository. While push saves the changes to the remote repository.
2.3.3.1 git add
This function is what adds files to your local repository. Only these files will be saved on github. Github have a repository size limit of 1 GB and files of 100 MB, since although they give you unlimited repositories, the space of each one is not, particularly in terms of databases. To add a file to your repository you just have to select the files in the git tab. By doing that a green letter A will appear instead of the two yellow question marks, as we see in the figure 2.9. In this case we only add the file Analisis.r to the repository but not the rest.
2.3.3.2 git commit
When you use the commit command you are saving the changes to the files you added to your local repository. To do this in Rstudio, in the same git tab, you must press the commit button as we see in the figure 2.10.
Pressing commit will open a pop-up window, where you will need to write a message describing what you will save. Once I’ve done that, press commit again in the popup as shown in figure 2.11.
2.3.3.3 git push
Finally, push will allow you to save changes to your remote repository, which secures your data in the cloud and also makes it available to other researchers. After pressing commit on the popup window (figure 2.11), we can press push on the green arrow on the popup window as seen in figure 2.12. Then we will be asked for our username and password, and we can check that our repository is online by entering our github session.
2.4 Reproducibility in R
There are several packages that allow us to do reproducible research in R
, but without a doubt the most relevant are rmarkdown and knitr. Both packages work together when we generate an Rmd (Rmarkdown) file, in which we use text, R code and other elements at the same time to generate a word document, pdf, web page, presentation and/or web application (fig. 2.13).
2.4.1 Creating an Rmarkdown
To create an Rmarkdown file, simply go to the menu File > New file > Rmarkdown and with that you will have created a new Rmd file. We will see some of the most typical elements of an Rmarkdown file.
2.4.1.1 Markdown
The markdown is the part of the file where we simply write text, although it has some formatting details such as generating bold text, italics, titles and subtitles.
To make text bold, it must be between two **bold**
asterisks, for text to appear italic it must be between *italic*
asterisks. Other examples are the titles of different levels, which are denoted with different numbers of #
, as well as the following 4 titles or subtitles:
subtitle 1
subtitle 2
2.4.1.2 Chunks
Chunks are one of the most important parts of an Rmarkdown. These are where the code from R (or other programming languages) is added. Which allows the product of our code not to be just a script with pasted results, but actually generated in the same document as our script. The easiest way to add a chunk is by pressing the insert chunk button in Rstudio, this button is located in the upper left window of our RStudio session, as shown in the figure 2.14
By pressing this button a space will appear, there you can add a code like the one below, and see the results below.
```{r}
library(tidyverse)
iris %>% group_by(Species) %>% summarize(Petal.Length = mean(Petal.length))
```
## # A tibble: 3 × 2
## Species Petal.Length
## <fct> <dbl>
## 1 setosa 1.46
## 2 versicolor 4.26
## 3 virginica 5.55
2.4.1.2.1 Chunk Options
There are many options for chunks, a complete documentation can be found at the following link, but here we will show the most common:
- echo = T or F show or not the code, respectively
- message = T or F displays packet messages, respectively
- warning = T or F displays warnings, respectively
- eval = T or F to evaluate or not the code, respectively
- cache = T or F save or not the result, respectively
2.4.1.3 inline code
The inline codes are useful to add some value in the text, such as the value of p or the mean. To use it, put a backtick (backsingle quote), r, the code in question and another backtick as follows `r R_code`
. We cannot put anything in an inline code, since it can only generate vectors, which often requires a lot of creativity to achieve what we want. For example if we wanted to put the average sepal length of the iris
database into an inline code we would put `r mean(iris$Sepal.Length)`
, which would result in 5.8433333. As a number with 7 significant figures would look strange in a text, we would also like to use the round
function, so that it has 2 significant figures, for that we put the following inline code `r round(mean( iris$Sepal.Length),2)`
which returns 5.84. This can be made even more complex if you want to work with a summary table. For example, if we wanted to list the average sepal size we would use summarize
from dplyr, but this would result in a data.frame, which doesn’t appear if we try to inline code. Let’s start by seeing how the code would look where we obtained the average size of the sepal.
%>%
iris group_by(Species) %>%
summarize(Mean = mean(Sepal.Length))
We would see the result of that code 2.1
Species | Mean |
---|---|
setosa | 5.006 |
versicolor | 5.936 |
virginica | 6.588 |
To remove the mean vector from this data frame we could subset it with the $
sign. So if we want to output the Mean column of the data frame we created as a vector, we would do the following `r (iris %>% group_by(Species) %>% summarize(Mean = mean(Sepal.Length )))$Mean`
. This would output 5.006, 5.936, 6.588.
2.4.2 Training
2.4.2.1 Exercise 1
Using the iris database, create an inline code that tells the average length of the petal of the species Iris virginica
2.4.2.2 Tables in Rmarkdown
The most typical function for generating tables in an rmd file is kable
from the knitr package, which in its simplest form includes a dataframe as its only argument. In addition to this, we can add some parameters such as caption, which allows us to put a title to the table or row.names, which if put as seen in the code (FALSE) will not show the names of the rows, as seen in the table 2.2.
<- iris %>%
DF group_by(Species) %>%
summarize_all(mean)
kable(DF, caption = "Average per species of all iris database variables.",
row.names = FALSE)
Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|---|
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |