Chapter 4 Models in R

4.1 Packages required for this chapter

For this chapter you need to have the tidyverse, broom and MuMIn packages installed.

This chapter will explain how to generate models in R, how to obtain information and tables from models with the Broom package (Robinson and Hayes 2018) and a brief introduction to model selection with the MuMIn package (Barton 2018 )

This class of the course can also be followed at this link.

4.2 Statistical models

A statistical model tries to explain the causes of an event based on a sample of the total population. The assumption is that if the sample we obtain from the population is representative of it, we will be able to infer the causes of the population variation by measuring explanatory variables. In general, we have a response variable (the phenomenon we want to explain), and one or more explanatory variables that would deterministically generate part of the variability in the response variable.

4.2.1 Example

Let’s take the example of the CO2 database present in R (Potvin, Lechowicz, and Tardif 1990). Suppose we are interested in knowing what factors affect the uptake of \(CO_2\) in plants.

Table 4.1: First 20 observations of the CO2 database.
Plant Type Treatment conc uptake
Qn1 Quebec nonchilled 95 16.0
Qn1 Quebec nonchilled 175 30.4
Qn1 Quebec nonchilled 250 34.8
Qn1 Quebec nonchilled 350 37.2
Qn1 Quebec nonchilled 500 35.3
Qn1 Quebec nonchilled 675 39.2
Qn1 Quebec nonchilled 1000 39.7
Qn2 Quebec nonchilled 95 13.6
Qn2 Quebec nonchilled 175 27.3
Qn2 Quebec nonchilled 250 37.1
Qn2 Quebec nonchilled 350 41.8
Qn2 Quebec nonchilled 500 40.6
Qn2 Quebec nonchilled 675 41.4
Qn2 Quebec nonchilled 1000 44.3
Qn3 Quebec nonchilled 95 16.2
Qn3 Quebec nonchilled 175 32.4
Qn3 Quebec nonchilled 250 40.3
Qn3 Quebec nonchilled 350 42.1
Qn3 Quebec nonchilled 500 42.9
Qn3 Quebec nonchilled 675 43.9

In the table 4.1 we see the first 20 observations of this database. We see that within the factors we have to explain the capture of \(CO_2\) are:

  • Type: Subspecies of plant (Mississippi or Quebec)
  • Treatment: Plant treatment, chilled or nonchilled
  • conc: Environmental concentration of \(CO_2\), in mL/L.

A possible explanation that would allow us to try to explain this phenomenon is that the plants of different subspecies will have different uptake of \(CO_2\), which we explore in the graph 4.1:

CO2 uptake by plants dependent on their subspecies

Figure 4.1: CO2 uptake by plants dependent on their subspecies

We see that there is a tendency for plants originating in Quebec to capture more \(CO_2\) than those from the Mississippi, but can we effectively say that both populations have different means? That’s where models come in.

4.2.2 Representing a model in R

In R most models are represented with the following code:

some_function(Y ~ X1 + X2 + ... + Xn, data = data.frame)

In this model, we have the response variable Y, which can be explained by one or multiple explanatory variables X, that is why the symbol ~ reads explained by, where what is on its left is the response variable and to the right the explanatory variable. The data is in a data frame and finally we will use some function, which will identify some model. Some of these functions are found in the table 4.2

Table 4.2: Some models that we can generate in R
Modelos Funcion
t-test t.test()
ANOVA aov()
Linear model lm()
Generalized linear model glm()
Generalized aditive model gam()
non-linear model nls()
Mixed effect models lmer()
Boosted regression trees gbm()

4.2.3 Let’s go back to the example of plants

For this example we will use a simple linear model, for this following the table 4.2 lets use the lm function:

Fit1 <- lm(uptake ~ Type, data = CO2)

4.2.3.1 Using broom to get more out of your model

the broom package (Robinson and Hayes 2018) is a package adjacent to the tidyverse (so you have to load it separately from the tidyverse), which allows us to take information from models generated in tidy format. Today we will look at 3 broom functions, these are glance, tidy and augment.

4.2.3.1.1 glance

The glance function will give us general information about the model, such as the p value, the \(R^2\), log-likelihood, degrees of freedom, and/or other parameters depending on the model to be used. This information is delivered to us in a data frame format, as we see in the following code and in the table 4.3

library(broom)
glance(Fit1)
Table 4.3: Model information fi1 delivered by the glance function
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.346713 0.3387461 8.794012 43.5191 0 1 -300.8007 607.6014 614.8939 6341.441 82 84
4.2.3.1.2 tidy

the tidy function will give us information about the model parameters, that is the intercept, the slope and/or interactions, as we see in the following code and in the table 4.4

tidy(Fit1)
Table 4.4: Model information fit1 delivered by the glance function
term estimate std.error statistic p.value
(Intercept) 33.54286 1.356945 24.719384 0
TypeMississippi -12.65952 1.919011 -6.596901 0
4.2.3.1.3 augment

The augment function will give us, for each observation of our model, several important parameters such as the predicted value, the residuals, the cook distance, among others, this mainly serves us to study the assumptions of our model. Below we see the use of the augment function and 20 of its observations in the table 4.5

augment(Fit1)
Table 4.5: Fit1 model information returned by the augment function
uptake Type .fitted .resid .hat .sigma .cooksd .std.resid
28.5 Mississippi 20.88333 7.6166667 0.0238095 8.806572 0.0093715 0.8766185
15.1 Quebec 33.54286 -18.4428571 0.0238095 8.601612 0.0549457 -2.1226279
41.4 Quebec 33.54286 7.8571429 0.0238095 8.803900 0.0099726 0.9042954
19.4 Mississippi 20.88333 -1.4833333 0.0238095 8.846557 0.0003554 -0.1707200
37.2 Quebec 33.54286 3.6571429 0.0238095 8.838566 0.0021605 0.4209084
11.3 Mississippi 20.88333 -9.5833333 0.0238095 8.782250 0.0148358 -1.1029663
34.8 Quebec 33.54286 1.2571429 0.0238095 8.847000 0.0002553 0.1446873
27.8 Mississippi 20.88333 6.9166667 0.0238095 8.813874 0.0077281 0.7960540
7.7 Mississippi 20.88333 -13.1833333 0.0238095 8.723037 0.0280755 -1.5172980
38.7 Quebec 33.54286 5.1571429 0.0238095 8.829102 0.0042963 0.5935466
41.8 Quebec 33.54286 8.2571429 0.0238095 8.799269 0.0110138 0.9503322
30.4 Quebec 33.54286 -3.1428571 0.0238095 8.841068 0.0015956 -0.3617181
30.9 Mississippi 20.88333 10.0166667 0.0238095 8.776132 0.0162078 1.1528396
30.0 Mississippi 20.88333 9.1166667 0.0238095 8.788531 0.0134261 1.0492567
34.6 Quebec 33.54286 1.0571429 0.0238095 8.847331 0.0001805 0.1216688
19.9 Mississippi 20.88333 -0.9833333 0.0238095 8.847439 0.0001562 -0.1131739
42.9 Quebec 33.54286 9.3571429 0.0238095 8.785334 0.0141437 1.0769336
32.4 Mississippi 20.88333 11.5166667 0.0238095 8.752828 0.0214255 1.3254778
35.0 Quebec 33.54286 1.4571429 0.0238095 8.846612 0.0003430 0.1677057
18.1 Mississippi 20.88333 -2.7833333 0.0238095 8.842591 0.0012514 -0.3203398

4.2.3.2 Model selection using broom and the AIC

The AIC, or Akaike Information Criterion (Aho, Derryberry, and Peterson 2014), is a measure of how much information a model gives us given its complexity. This last measure from the number of parameters it has. The lower the AIC, the comparatively better a model is, and in general, a model that is two AIC units lower than another model will be considered a model that is significantly better than another.

The Akaike selection criterion formula is the one we see in the equation (4.1).

\[\begin{equation} AIC = 2 K - 2 \ln{(\hat{L})} \tag{4.1} \end{equation}\]

Where \(K\) is the number of parameters, which we can see with tidy, if we look at the table 4.4, we see that the model Fit1 has 2 parameters, that is \(K\) is equal to 2 .

The log-likelihood of the model (\(\ln{(\hat{L})}\)) is the fit it has to the data. The more positive this value is, the better the model fits the data, and the more negative it is, the less it fits the data. In our model, using glance, we can see that the model’s log-likelyhood value is -300.8 (see table 4.4).

Therefore, substituting the equation (4.1), we obtain 605.6, which is a value very close to the 608, which appear in the model’s glance (table 4.4).

4.2.3.2.1 Candidate Models

Let’s look at the figure 4.2. to think about what might be interesting models to explore.

ggplot(CO2, aes(x = conc, y = uptake)) + geom_point(aes(color = Type,
    shape = Treatment), size = 3)
Exploratory graph to generate models from the CO2 database

Figure 4.2: Exploratory graph to generate models from the CO2 database

Aho, Ken, DeWayne Derryberry, and Teri Peterson. 2014. “Model Selection for Ecologists: The Worldviews of AIC and BIC.” Ecology 95 (3): 631–36.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, and Winston Chang. 2018. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.
Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bulletin of the American Iris Society 59: 2–5.
Barton, Kamil. 2018. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.
Henderson, Harold V, and Paul F Velleman. 1981. “Building Multiple Regression Models Interactively.” Biometrics, 391–411.
Kross, Sean, Nick Carchedi, Bill Bauer, and Gina Grdina. 2017. Swirl: Learn r, in r. https://CRAN.R-project.org/package=swirl.
Leek, Jeff. 2015. “The Elements of Data Analytic Style.” J. Leek.—Amazon Digital Services, Inc.
Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27.
Potvin, Catherine, Martin J Lechowicz, and Serge Tardif. 1990. “The Statistical Analysis of Ecophysiological Response Curves Obtained from Experiments Involving Repeated Measures.” Ecology 71 (4): 1389–1400.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David, and Alex Hayes. 2018. Broom: Convert Statistical Analysis Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
Wickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2021. Usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

References

Aho, Ken, DeWayne Derryberry, and Teri Peterson. 2014. “Model Selection for Ecologists: The Worldviews of AIC and BIC.” Ecology 95 (3): 631–36.
Barton, Kamil. 2018. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.
Potvin, Catherine, Martin J Lechowicz, and Serge Tardif. 1990. “The Statistical Analysis of Ecophysiological Response Curves Obtained from Experiments Involving Repeated Measures.” Ecology 71 (4): 1389–1400.
Robinson, David, and Alex Hayes. 2018. Broom: Convert Statistical Analysis Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.