10/02, 2022

Welcome

For next, you need:

  • Create a github account
  • Install the knitr, rmarkdown and kableExtra packages

Reproducible research in R

How data is organized in r

Data structures

  • Vector: A linear set of data (gene sequence, time series)
  • Matrix: A table with only numbers
  • Data Frame: A table where each column has a data type (gold standard)
  • List: Here we can put whatever we want

Vector

  • Linear sequence of data
  • They can be of many types (numeric, character, logical, etc.)
  • Example data(uspop)
  • to create one c(1,4,6,7,8)
  • to subset a vector put the index between []
  • uspop[4], uspop[2:10], uspop[c(3,5,8)]

Data Frame

  • A table, each column a data type (Numeric, logical, etc)
  • Each column a vector
  • Example data(iris)
  • To subset data.frame[rows,columns]
  • Examples iris[,3], iris[“Petal.Length”], iris[2:5,c(1,5)], iris$Petal.Length

Tidydata principles

Tidy Data

  • Each column a variable
  • Each row one observation

untidy data

untidy data

untidy data

  • Contingency tables
  • Example data(HairEyeColor)
Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8

Tidy option

Hair Eye Sex Freq
Black Brown Male 32
Brown Brown Male 53
Red Brown Male 10
Blond Brown Male 3
Black Blue Male 11
Brown Blue Male 50
Red Blue Male 10
Blond Blue Male 30
Black Hazel Male 10
Brown Hazel Male 25
Red Hazel Male 7
Blond Hazel Male 5
Black Green Male 3
Brown Green Male 15
Red Green Male 7
Blond Green Male 8
Black Brown Female 36
Brown Brown Female 66
Red Brown Female 16
Blond Brown Female 4
Black Blue Female 9
Brown Blue Female 34
Red Blue Female 7
Blond Blue Female 64
Black Hazel Female 5
Brown Hazel Female 29
Red Hazel Female 7
Blond Hazel Female 5
Black Green Female 2
Brown Green Female 14
Red Green Female 7
Blond Green Female 8

Lets work with tidy data

dplyr

  • A package with a few very powerfull functions to wrangle data

  • Part of tidyverse

  • group_by (group data)

  • summarize

  • filter (Find rows with certain conditions)

  • select together with starts_with, ends_with or contains

  • mutate (Generates new variables)

  • %>% pipeline

  • arrange sorts

summarize and group_by

  • group_by groups observations according to a variable
  • summarize
library(tidyverse)
Summary.Petal <- summarize(iris, Mean.Petal.Length = mean(Petal.Length),
    SD.Petal.Length = sd(Petal.Length))
Mean.Petal.Length SD.Petal.Length
3.758 1.765298

summarize and group_by (continued)

Summary.Petal <- group_by(iris, Species)
Summary.Petal <- summarize(Summary.Petal, Mean.Petal.Length = mean(Petal.Length),
    SD.Petal.Length = sd(Petal.Length))
Species Mean.Petal.Length SD.Petal.Length
setosa 1.462 0.1736640
versicolor 4.260 0.4699110
virginica 5.552 0.5518947

summarize and group_by (continued)

  • Can group more than one variable at a time
data("mtcars")
Mtcars2 <- group_by(mtcars, am, cyl)
Consumo <- summarize(Mtcars2, Average_MPG = mean(mpg),
    desv = sd(mpg))
am cyl Average_MPG desv
0 4 22.90000 1.4525839
0 6 19.12500 1.6317169
0 8 15.05000 2.7743959
1 4 28.07500 4.4838599
1 6 20.56667 0.7505553
1 8 15.40000 0.5656854

Doubts?

mutate

  • Creates new variables
DF <- mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Sepal.Ratio
5.8 4.0 1.2 0.2 setosa 0.21
4.7 3.2 1.6 0.2 setosa 0.34
5.1 3.8 1.9 0.4 setosa 0.37
5.2 2.7 3.9 1.4 versicolor 0.75
6.4 2.9 4.3 1.3 versicolor 0.67
5.5 2.5 4.0 1.3 versicolor 0.73
6.5 3.0 5.8 2.2 virginica 0.89
6.0 2.2 5.0 1.5 virginica 0.83
6.1 2.6 5.6 1.4 virginica 0.92
5.9 3.0 5.1 1.8 virginica 0.86

Pipeline (%>%)

  • To perform multiple operations sequentially
  • without resorting to nested parentheses
  • overwrite multiple databases
x <- c(1, 4, 6, 8)
y <- round(mean(sqrt(log(x))), 2)
  • What did we do there?
x <- c(1, 4, 6, 8)
y <- x %>%
    log() %>%
    sqrt() %>%
    mean() %>%
    round(2)
## [1] 0.99

Pipeline (%>%)

  • A lot of intermediate objects
DF <- mutate(iris, Petal.Sepal.Ratio = Petal.Length/Sepal.Length)
BySpecies <- group_by(DF, Species)
Summary.Byspecies <- summarize(BySpecies, MEAN = mean(Petal.Sepal.Ratio),
    SD = sd(Petal.Sepal.Ratio))
Species MEAN SD
setosa 0.2927557 0.0347958
versicolor 0.7177285 0.0536255
virginica 0.8437495 0.0438064

Pipeline (%>%)

  • With pipe
Summary.Byspecies <- summarize(group_by(mutate(iris,
    Petal.Sepal.Ratio = Petal.Length/Sepal.Length),
    Species), MEAN = mean(Petal.Sepal.Ratio), SD = sd(Petal.Sepal.Ratio))
Species MEAN SD
setosa 0.2927557 0.0347958
versicolor 0.7177285 0.0536255
virginica 0.8437495 0.0438064

Pipeline (%>%) another example

library(tidyverse)
MEAN <- iris %>%
    group_by(Species) %>%
    summarize_all(.funs = list(Mean = mean, SD = sd))
Species Sepal.Length_Mean Sepal.Width_Mean Petal.Length_Mean Petal.Width_Mean Sepal.Length_SD Sepal.Width_SD Petal.Length_SD Petal.Width_SD
setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856
versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983 0.4699110 0.1977527
virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501

More doubts?

Filter

  • Select according to one or more variables
Symbol Meaning simbolo_cont significado_cont
> Greater than != other than
< Less than %in% within the group
== Equal to is.na is NA
>= Greater than or equal to !is.na is not NA
<= Less than or equal to | & or, and

Examples of filter added to what we did

data("iris")
DF <- iris %>%
    filter(Species != "versicolor") %>%
    group_by(Species) %>%
    summarise_all(mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
virginica 6.588 2.974 5.552 2.026

Examples of filter

DF <- iris %>%
    filter(Petal.Length >= 4 & Sepal.Length >= 5) %>%
    group_by(Species) %>%
    summarise(N = n())
Species N
versicolor 39
virginica 49

More than one functiin

data("iris")
DF <- iris %>%
    filter(Species != "versicolor") %>%
    group_by(Species) %>%
    summarise_all(.funs = list(Mean = mean, SD = sd))
Species Sepal.Length_Mean Sepal.Width_Mean Petal.Length_Mean Petal.Width_Mean Sepal.Length_SD Sepal.Width_SD Petal.Length_SD Petal.Width_SD
setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856
virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501

Select

  • Selects columns within a data.frame, or take them out
iris %>%
    group_by(Species) %>%
    select(Petal.Length, Petal.Width) %>%
    summarize_all(mean)
iris %>%
    group_by(Species) %>%
    select(-Sepal.Length, -Sepal.Width) %>%
    summarize_all(mean)
iris %>%
    group_by(Species) %>%
    select(contains("Petal")) %>%
    summarize_all(mean)
iris %>%
    group_by(Species) %>%
    select(-contains("Sepal")) %>%
    summarize_all(mean)
Species Petal.Length Petal.Width
setosa 1.462 0.246
versicolor 4.260 1.326
virginica 5.552 2.026

Excercices

Excercices

Active_Cases <- read_csv("https://raw.githubusercontent.com/MinCiencia/Datos-COVID19/master/output/producto19/CasosActivosPorComuna_std.csv")

Using the repository database of the ministry of science of chile, generate a dataframe that answers the following:

  • What proportion of the communes has had at some point more than 50 cases per 100,000 inhabitants?
  • Generates a dataframe, where it appears for each commune that has had over 50 cases per 100,000 inhabitants, how many days it has had over that value.
  • Generates a table of which communes have had over 50 cases per 100,000 inhabitants and from those communes creates a variable that is the maximum prevalence of said commune.

Bonus (This requires research not enough what we learned)

  • See which are the 10 communes that have had the highest median prevalence, for each of these 10 communes, generate a table with the median, maximum prevalence and date on which the maximum prevalence was reached