4 Tidy Evaluation

Created: 2017-06-24
Updated: 2023-07-29

4.1 ⭐️Overview

This chapter is about tidy evaluation. Tidy evaluation has gone through several fairly significant changes since I first wrote these notes.

Rlang 0.3.1 replaced quo_name() with as_label() and as_name().

Rlang 0.4.0 add the curly-curly syntax.

4.3 📦Load packages

library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)
library(glue)

4.4 🔢Simulate data

Here we simulate a small dataset that is intended to be representative of data from a research study.

set.seed(123)
study <- tibble(
  id     = as.character(seq(1001, 1020, 1)),
  sex    = factor(sample(c("Female", "Male"), 20, TRUE)),
  date   = sample(seq.Date(as.Date("2021-09-15"), as.Date("2021-10-26"), "day"), 20, TRUE),
  days   = sample(1L:21L, 20L, TRUE),
  height = rnorm(20, 71, 10)
)

# Add missing values for testing
study$id[3] <- NA
study$sex[4] <- NA
study$date[5] <- NA
study$days[6] <- NA
study$height[7] <- NA

4.5 Load Starwars data

We will also load the mtcars data starwars data used in many of the Tidyverse examples.

data(starwars)
data(mtcars)

4.6 Why use tidy evaluation?

The short answer is because of data masking, which is easier to see than describe.

# Unmasked programming
mean(mtcars$cyl + mtcars$am)
## [1] 6.59375
# Referring to columns is an error - Where is the data?
mean(cyl + am)
## Error in mean(cyl + am): object 'cyl' not found
# Data-masking (using base R)
with(mtcars, mean(cyl + am))
## [1] 6.59375
# Data-masking (using dplyr)
summarise(mtcars, mean(cyl + am))
##   mean(cyl + am)
## 1        6.59375

4.7 What is tidy evaluation?

Tidy Eval is as system for programming (i.e., writing new functions), as opposed to working interactively with dplyr.

While data-masking makes it easy to program interactively with data frames, it makes it harder to create functions. Passing data-masked arguments to functions requires injection with the embracing operator {{ or, in more complex cases, the injection operator !!. rlang documentation

So, tidy evaluation, operationalized primarily through the rlang package, is shorthand for a set of tools that allows us to more easily use data masking in the functions that we write. It also allows us to use other functions that use data masking (e.g., dplyr functions) the functions that we write.

4.8 Vocabulary

The old vocabulary was heavily centered around quasiquotation. It appears as thought the rlang team is moving towards using the terms defusing, embracing and injecting.

Injection (also known as quasiquotation) is a metaprogramming feature that allows you to modify parts of a program. This is needed because under the hood data-masking works by defusing R code to prevent its immediate evaluation. The defused code is resumed later on in a context where data frame columns are defined. rlang documentation

One purpose for defusing evaluation of an expression is to interface with data-masking functions by injecting the expression back into another function with !!. This is the defuse-and-inject pattern. rlang documentation

The defuse-and-inject pattern

my_summarise <- function(data, arg) {
  # Defuse the user expression in `arg`
  arg <- enquo(arg)

  # Inject the expression contained in `arg`
  # inside a `summarise()` argument
  data |> dplyr::summarise(mean = mean(!!arg, na.rm = TRUE))
}

Defuse-and-inject is usually performed in a single step with the embrace operator {{.

my_summarise <- function(data, arg) {
  # Defuse and inject in a single step with the embracing operator
  data |> dplyr::summarise(mean = mean({{ arg }}, na.rm = TRUE))
}

Using enquo() and !! separately is useful in more complex cases where you need access to the defused expression instead of just passing it on. rlang documentation

Defused arguments and quosures

If you inspect the return values of expr() and enquo(), you’ll notice that the latter doesn’t return a raw expression like the former. Instead it returns a quosure, a wrapper containing an expression and an environment. rlang documentation

expr(1 + 1)
## 1 + 1
my_function <- function(arg) enquo(arg)
my_function(1 + 1)
## <quosure>
## expr: ^1 + 1
## env:  global

R needs information about the environment to properly evaluate argument expressions because they come from a different context than the current function. rlang documentation

4.9 Key Functions

4.9.1 The qq_show function

The qq_show() function helps examining injected expressions inside a function. This is useful for learning about injection and for debugging injection code.

my_mean <- function(data, var) {
  rlang::qq_show(data %>% dplyr::summarise(mean({{ var }})))
}

mtcars %>% my_mean(cyl)
## data %>% dplyr::summarise(mean(^cyl))

4.9.2 The quo function

The quo() function creates a class quoture object, which is a special type of formula.

Use quo() to capture expressions when programming outside of user-defined functions.

# What does quo() return?
quo(species) # Where species is a variable in the Starwars tibble
## <quosure>
## expr: ^species
## env:  global
# Basic usage of quo() in function
freq_table <- function(df, x, ...) {
  df %>%            # No quoting and unquoting necessary for the tibble
    count(!!x) %>%  # Don't forget to unquote (!!) where you want the quoture evaluated
    top_n(3, n)     # Return top 3 results
}

freq_table(df = starwars, x = quo(species))
## # A tibble: 3 × 2
##   species     n
##   <chr>   <int>
## 1 Droid       6
## 2 Human      35
## 3 <NA>        4

4.9.3 The enquo function

If you want the user of your function to be able to pass the variable name as an argument without wrapping in quo(), that’s where enquo() comes in.

# Basic usage of enquo() in function
freq_table <- function(df, x, ...) {
  x <- enquo(x)     # Capturing function argument and turning it into a quoture
  df %>%                             
    count(!!x) %>%
    top_n(3, n)                      
}

freq_table(df = starwars, x = species) # Notice we no longer need to wrap species with quo()
## # A tibble: 3 × 2
##   species     n
##   <chr>   <int>
## 1 Droid       6
## 2 Human      35
## 3 <NA>        4

4.9.4 The embrace operator

As mentioned above in the discussion of the defuse-and-inject pattern, the embrace operator {{ can often be used to defuse-and-inject in a single step.

freq_table <- function(df, x, ...) {
  df %>%                             
    count({{ x }}) %>%
    top_n(3, n)                      
}

freq_table(df = starwars, x = species)
## # A tibble: 3 × 2
##   species     n
##   <chr>   <int>
## 1 Droid       6
## 2 Human      35
## 3 <NA>        4

Where the embrace operator can get you in trouble is with nested functions (see below) and unquote-splicing (see below).

4.9.5 The quos function

Use quos() with ... when you want to pass multiple variables / arguments / expressions into your function. Must unquote-splice !!! in your function to evaluate.

# What does quos() return?
quos(species, name) # Where species and name are variables in the Starwars tibble
## <list_of<quosure>>
## 
## [[1]]
## <quosure>
## expr: ^species
## env:  global
## 
## [[2]]
## <quosure>
## expr: ^name
## env:  global

You can iterate over the list of quotures returned by quos()

my_quos <- quos(species, name)

for(i in seq_along(my_quos)) {
  print(my_quos[[i]])
}
## <quosure>
## expr: ^species
## env:  global
## <quosure>
## expr: ^name
## env:  global

4.9.6 The enquos function

Typically you will use enquos() instead of quos(), and use it with the dot-dot-dot argument to a function. When you do, don’t forget to unquote-splice with !!!.

grouped_mean <- function(df, x, ...) {
  mean_var <- enquo(x)
  group_vars <- enquos(...)
  
  df %>% 
    group_by(!!!group_vars) %>% 
    summarise(mean = mean(!!mean_var), .groups = "drop")
}

grouped_mean(mtcars, disp, cyl, am)
## # A tibble: 6 × 3
##     cyl    am  mean
##   <dbl> <dbl> <dbl>
## 1     4     0 136. 
## 2     4     1  93.6
## 3     6     0 205. 
## 4     6     1 155  
## 5     8     0 358. 
## 6     8     1 326

Or

freq_table <- function(df, ...) { # Notice we dropped the "x" argument
  x <- enquos(...)                # Capturing function argument and turning it into a quoture list
  
  df %>%                             
    count(!!!x) %>%               # Must use unquote-splice (!!!) in this case
    slice(1:5)                      
}

freq_table(df = starwars, species, hair_color)
## # A tibble: 5 × 3
##   species  hair_color     n
##   <chr>    <chr>      <int>
## 1 Aleena   none           1
## 2 Besalisk none           1
## 3 Cerean   white          1
## 4 Chagrian none           1
## 5 Clawdite blonde         1

Note that the embrace operator cannot be used to unquote-splice the ... argument.

freq_table <- function(df, ...) { 
  df %>%                             
    count({{ ... }}) %>%  # Must use unquote-splice (!!!) in this case
    slice(1:5)                      
}

freq_table(df = starwars, species, hair_color)
## Error in FUN(X[[i]], ...): object 'hair_color' not found

4.9.7 The as_label and as_name functions

Rlang 0.3.1 replaced quo_name() with as_label() and as_name().

Sometimes we want to convert the argument to a string for use in our function output. For example, we may want to dynamically create variable names inside the function.

# What does as_label and as_name return?
# Input must be a string or a quoture
list(
  as_label_quotes  = rlang::as_label("height"),
  as_lable_quoture = rlang::as_label(quo(height)),
  as_name_quotes   = rlang::as_name("height"),
  as_name_quoture  = rlang::as_name(quo(height))
)
## $as_label_quotes
## [1] "\"height\""
## 
## $as_lable_quoture
## [1] "height"
## 
## $as_name_quotes
## [1] "height"
## 
## $as_name_quoture
## [1] "height"

I still don’t fully understand when to use one versus the other, but so far, they have been most useful for converting symbols/quotures to character strings.

continuous_table <- function(df, x) {
  x <- enquo(x)                                 # Must enquo first
  mean_name <- paste0("mean_", rlang::as_name(x))
  sum_name  <- paste0("sum_", rlang::as_name(x))
  
  df %>% 
    summarise(
      !!mean_name := mean(!!x, na.rm = TRUE), # Must use !! and := to set the variable names
      !!sum_name  := sum(!!x, na.rm = TRUE)
    )
}

continuous_table(starwars, height)
## # A tibble: 1 × 2
##   mean_height sum_height
##         <dbl>      <int>
## 1        174.      14123

Alternatively, with the embrace operator and glue to make the var names (supported as of rlang 0.4.3).

continuous_table <- function(df, x) {
  df %>% 
    summarise(
      "mean_{{ x }}" := mean({{ x }}, na.rm = TRUE), # Must use := to set the variable names
      "sum_{{ x }}"  := sum({{ x }}, na.rm = TRUE)
    )
}

continuous_table(starwars, height)
## # A tibble: 1 × 2
##   mean_height sum_height
##         <dbl>      <int>
## 1        174.      14123

4.9.8 The sym function

The sym() takes a string as an input and turns it into a symbol.

my_col <- "height"
rlang::qq_show(
  starwars %>% 
    summarize(
      mean(my_col)
    )
)
## starwars %>% summarize(mean(my_col))

Doesn’t work because R will look for a variable named “my_col” in the data frame “starwars”.

my_col <- "height"
rlang::qq_show(
  starwars %>% 
    summarize(
      mean(!!my_col)
    )
)
## starwars %>% summarize(mean("height"))

Doesn’t work because R will try to calculate the mean of the character string “height”.

my_col <- rlang::sym("height")
rlang::qq_show(
  starwars %>% 
    summarize(
      mean(!!my_col)
    )
)
## starwars %>% summarize(mean(height))

This looks like what we would type manually.

my_col <- rlang::sym("height")
starwars %>% 
  summarize(
    mean = mean(!!my_col, na.rm = TRUE)
  )
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  174.

And it works as expected

4.9.9 The syms function

Like sym(), but can convert multiple strings into a list of symbols

my_cols <- rlang::syms(c("height", "mass"))
rlang::qq_show(
  starwars %>% 
    summarize(
      mean(!!my_cols)
    )
)
## starwars %>% summarize(mean(<list: height, mass>))

Notice that unquoting with !! returns a list of symbols. To unlist them, we must use the splice operator.

my_cols <- rlang::syms(c("height", "mass"))
rlang::qq_show(
  starwars %>% 
    summarize(
      mean(!!!my_cols)
    )
)
## starwars %>% summarize(mean(height, mass))

Of course, to make this meaningful we need to map it over height and mass

my_cols <- rlang::syms(c("height", "mass"))

summarise_avg <- function(data, col) {
  col <- enquo(col)
  data %>% 
    summarise(avg = mean(!!col, na.rm = TRUE))
}

results <- purrr::map_df(my_cols, summarise_avg, data = starwars)
results
## # A tibble: 2 × 1
##     avg
##   <dbl>
## 1 174. 
## 2  97.3

4.9.10 The rlang prounouns

The rlang package includes two (as of this writing) pronouns: .data and .env. I’m still slightly confused about what these pronouns are (see SO post here), but I’m getting more comfortble with how they are used.

I found a nice example on this blog post.

Because of data masking, filter looks for cyl and carb in mtcars and it returns rows where the value of cyl matches the value of carb.

mtcars %>% filter(cyl == carb)
##                mpg cyl disp  hp drat   wt qsec vs am gear carb
## Ferrari Dino  19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
## Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8

In this example, because num_cyl doesn’t exist in mtcars, filter will automatically look to the global environment and return rows where the value of cyl matches the value of num_cyl (a constant 6).

num_cyl <- 6
mtcars %>% filter(cyl == num_cyl)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Now, we create an object in the global environment that shares its name with a column in mtcars - carb. Because of data masking (and scoping rules), filter still looks for cyl and carb in mtcars first. Because carb exists in mtcars, filter returns rows where the value of cyl matches the value of mtcars$carb - not the carb object in the global environment.

carb <- 6
mtcars %>% filter(cyl == carb)
##                mpg cyl disp  hp drat   wt qsec vs am gear carb
## Ferrari Dino  19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
## Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8

But, we can be more explicit (i.e., safer) about using mtcars$carb with the .data pronoun.

carb <- 6
mtcars %>% filter(.data$cyl == .data$carb)
##                mpg cyl disp  hp drat   wt qsec vs am gear carb
## Ferrari Dino  19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
## Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8

OR

carb <- 6
mtcars %>% filter(.data[["cyl"]] == .data[["carb"]])
##                mpg cyl disp  hp drat   wt qsec vs am gear carb
## Ferrari Dino  19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
## Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8

Similarly, we can use the .env pronoun to explicity instruct filter to compare cyl to the carb object in the global environment.

carb <- 6
mtcars %>% filter(.data[["cyl"]] == .env[["carb"]])
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

4.10 Ellipsis

In technical language, the three dots argument in R is called an ellipsis. And it means that the function is designed to take any number of named or unnamed arguments. The interesting question is: How do you write functions that make use of ellipsis? The answer is very simple: you simply convert the … to a list, like so:

f <- function(...) {
  arguments <- list(...)
  paste(arguments)
}

f("Hello", "World", "!")
## [1] "Hello" "World" "!"

So, when should I use the ellipsis argument? That, again, is very simple: there are essentially two situations when you can use the three dots:

  1. When it makes sense to call the function with a variable number of arguments. See the f function immediately above. Another very prominent example is the paste() function.

  2. When, within your function, you call other functions, and these functions can have a variable number of arguments, either because (a) the called function is generic like print() or (b) the called function can be passed into the function as an argument, as for example with the FUN argument in apply(). (apply <- function (X, MARGIN, FUN, ...)).

4.11 Dynamic dots

In addition to the base ellipsis syntax, rlang supports something it calls dynamic dots. Programming with dynamic dots (…) presents some opportunities and also some challenges.

  1. You can splice arguments saved in a list with the splice operator !!!.

  2. You can inject names with glue syntax on the left-hand side of ⁠:=⁠.

  3. Trailing commas are ignored, making it easier to copy and paste lines of arguments.

If your function takes dots, adding support for dynamic features is as easy as collecting the dots with list2() instead of list(). See also dots_list(), which offers more control over the collection.

In general, passing ... to a function that supports dynamic dots causes your function to inherit the dynamic behavior.

In packages, document dynamic dots with this standard tag:

@param ... <[dynamic-dots][rlang::dyn-dots]> What these dots do.

f <- function(...) {
  out <- rlang::list2(...)
  rev(out)
}

# Trailing commas are ignored
f(this = "that", )
## $this
## [1] "that"
# Splice lists of arguments with `!!!`
x <- list(alpha = "first", omega = "last")
f(!!!x)
## $omega
## [1] "last"
## 
## $alpha
## [1] "first"
# Inject a name using glue syntax
if (rlang::is_installed("glue")) {
  nm <- "key"
  f("{nm}" := "value")
  f("prefix_{nm}" := "value")
}
## $prefix_key
## [1] "value"

Diffuse and inject unquoted column names

# Doesn't work
f <- function(.data, ...) {
  list(...)
}

mtcars %>% f(cyl, am)
## Error in f(., cyl, am): object 'cyl' not found
# Doesn't work
f <- function(.data, ...) {
  rlang::list2(...)
}

mtcars %>% f(cyl, am)
## Error in rlang::list2(...): object 'cyl' not found
# Must diffuse first
f <- function(.data, ...) {
  enquos(...)
}

mtcars %>% f(cyl, am)
## <list_of<quosure>>
## 
## [[1]]
## <quosure>
## expr: ^cyl
## env:  0x7fdb27174ff0
## 
## [[2]]
## <quosure>
## expr: ^am
## env:  0x7fdb27174ff0

Now you can inject them into tidyverse functions with the splice operator:

# Must diffuse first
f <- function(.data, ...) {
  dot_vars <- enquos(...)
  .data %>% count(!!!dot_vars)
}

mtcars %>% f(cyl, am)
##   cyl am  n
## 1   4  0  3
## 2   4  1  8
## 3   6  0  4
## 4   6  1  3
## 5   8  0 12
## 6   8  1  2

4.11.1 Convert quotures to strings

It took me awhile to figure this out. The answer eventually came from: https://adv-r.hadley.nz/quasiquotation.html#quasi-motivation.

Start by using ensyms() instead of enquos() to return naked expressions instead of quosures (https://rlang.r-lib.org/reference/defusing-advanced.html).

f <- function(.data, ...) {
  rlang::ensyms(...)
}

mtcars %>% f(cyl, am)
## [[1]]
## cyl
## 
## [[2]]
## am

Then use purrr::map() and rlang::as_sting() or rlang::as_name() to convert symbols to character strings.

f <- function(.data, ...) {
  dot_syms <- rlang::ensyms(...)
  purrr::map(dot_syms, rlang::as_name)
}

mtcars %>% f(cyl, am)
## [[1]]
## [1] "cyl"
## 
## [[2]]
## [1] "am"

4.11.2 Example: Multiple n-way tables

This example comes from when I was working on the freqtables package. I was trying to create a wrapper to around a simple purrr iteration and wanted to use the dot arguments at names to the output list. In other words, take unquoted variable names, diffuse and inject them for analysis, then turn them into character strings.

# Multiple n-way tables
freq_table2 <- function(.data, .freq_var, drop = FALSE) {
  .data <- dplyr::count(.data, {{ .freq_var }}, .drop = drop)
  .data
}

# For testing
# mtcars %>% 
#   group_by(am) %>% 
#   freq_table2(cyl)

# And if you want more than one table
# purrr::map(
#   .x = quos(cyl, vs),
#   .f = ~ mtcars %>% group_by(am) %>% freq_table2({{ .x }})
# )
# Make a wrapper
freq_tables <- function(.data, ...) {
  # Defuse the user expression in `...` for calculations
  dot_vars <- enquos(...)
  # Make syms and then strings for naming the list
  dot_syms <- rlang::ensyms(...)
  dot_names <- purrr::map(dot_syms, rlang::as_name)
  # Perform the calculations
  purrr::map(
    .x = dot_vars, # Could also use enquos(...) directly here
    .f = ~ .data %>% freq_table2({{ .x }}) # Must use !! or {{
  ) %>% 
    rlang::set_names(dot_names)
}

mtcars %>% 
  freq_tables(cyl, vs)
## $cyl
##   cyl  n
## 1   4 11
## 2   6  7
## 3   8 14
## 
## $vs
##   vs  n
## 1  0 18
## 2  1 14

4.12 Example: for Loop

In this example, I’m creating a table of summary statistics using the Starwars data. The table will compare some simple characteristics of the characters by species.

First, I’m going to reclassify every character as Human or Not Human

starwars <- mutate(starwars, human = if_else(species == "Human", "Yes", "No", NA_character_))

Now I’m going to create the table shell

vars = 3        # Number of vars
rows = vars + 1 # Additional row for group sample size
table <- tibble(
  Variable = vector(mode = "character", length = rows),
  Human = vector(mode = "character", length = rows),
  `Not Human` = vector(mode = "character", length = rows)
)

# N for Human
table[1, 2] <- paste0(
  "(N = ",
  filter(starwars, human == "Yes") %>% nrow() %>% format(big.mark = ","),
  ")"
)

# N for Not Human
table[1, 3] <- paste0(
  "(N = ",
  filter(starwars, human == "No") %>% nrow() %>% format(big.mark = ","),
  ")"
)
table
## # A tibble: 4 × 3
##   Variable Human      `Not Human`
##   <chr>    <chr>      <chr>      
## 1 ""       "(N = 35)" "(N = 48)" 
## 2 ""       ""         ""         
## 3 ""       ""         ""         
## 4 ""       ""         ""

Finally, I’ll fill in the table using a for loop. In this case, I just want to compare the mean height, mass, and birth year of humans and non-humans.

vars <- quos(height, mass, birth_year)                    # Create vector of quotures for variables of interest

for(i in seq_along(vars)) {
  table[i + 1, ] <- starwars %>%                          # Row of table to receive loop output
    filter(!is.na(human)) %>% 
    group_by(human) %>% 
    summarise(Mean = mean(!!vars[[i]], na.rm = TRUE)) %>% # Use !! with vars[[i]]
    mutate(Mean = round(Mean, 1) %>% format(nsmall = 1)) %>% 
    tidyr::pivot_wider(
      names_from = human,
      values_from = Mean
    ) %>% 
    mutate(Variable = rlang::as_name(vars[[i]])) %>%      # Use as_name to get variable name for first column
    select(Variable, Yes, No)
}
table
## # A tibble: 4 × 3
##   Variable     Human      `Not Human`
##   <chr>        <chr>      <chr>      
## 1 ""           "(N = 35)" (N = 48)   
## 2 "height"     "176.6"    172.4      
## 3 "mass"       " 82.8"    107.6      
## 4 "birth_year" " 53.4"    139.3

4.13 Example: function

In this example, I’m creating a table of summary statistics using the Starwars data. The table will compare some simple characteristics of the characters by species.

First, I’m going to reclassify every character as Human or Not Human

starwars <- mutate(starwars, human = if_else(species == "Human", "Yes", "No", NA_character_))

Now I’m going to create the table shell

vars = 3        # Number of vars
rows = vars + 1 # Additional row for group sample size
table <- tibble(
  Variable = vector(mode = "character", length = rows),
  Human = vector(mode = "character", length = rows),
  `Not Human` = vector(mode = "character", length = rows)
)

# N for Human
table[1, 2] <- paste0(
  "(N = ",
  filter(starwars, human == "Yes") %>% nrow() %>% format(big.mark = ","),
  ")"
)

# N for Not Human
table[1, 3] <- paste0(
  "(N = ",
  filter(starwars, human == "No") %>% nrow() %>% format(big.mark = ","),
  ")"
)
table
## # A tibble: 4 × 3
##   Variable Human      `Not Human`
##   <chr>    <chr>      <chr>      
## 1 ""       "(N = 35)" "(N = 48)" 
## 2 ""       ""         ""         
## 3 ""       ""         ""         
## 4 ""       ""         ""

Finally, I’ll fill in the table using a user-defined function. In this case, I just want to compare the mean height, mass, and birth year of humans and non-humans.

my_stats <- function(df, vars) {
  df %>% 
    filter(!is.na(human)) %>% 
    group_by(human) %>% 
    # Calculate means
    summarise(
      across(
        .cols  = {{ vars }},
        .fns   = mean, na.rm = TRUE
      )
    ) %>% 
    # Format the results
    mutate(
      across(
        .cols = where(is.numeric),
        .fns  = ~ round(.x, 1) %>% format(nsmall = 1)
      )
    ) %>% 
    # Restructure results to match the summary table
    tidyr::pivot_wider(
      names_from = human,
      values_from = {{ vars }},
      names_sep = "-"
    ) %>%
    tidyr::pivot_longer(
      cols = everything(),
      names_to = c("Variable", ".value"),
      names_sep = "-"
    ) %>%
    # Reorder and rename the columns to match the output table
    select(Variable, Human = Yes, `Not Human` = No)
}

# For testing
# my_stats(starwars, c(height, mass, birth_year))
# Or if you prefer to use ...
my_stats <- function(df, ...) {
  
  vars <- enquos(...)
  
  df %>% 
    filter(!is.na(human)) %>% 
    group_by(human) %>% 
    # Calculate means
    summarise(
      across(
        .cols  = c(!!!vars),
        .fns   = mean, na.rm = TRUE
      )
    )
}

# For testing
# my_stats(starwars, height, mass, birth_year)
table %>% 
  bind_rows(
    my_stats(starwars, c(height, mass, birth_year))
  )
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.cols = c(height, mass, birth_year), .fns = mean,
##   na.rm = TRUE)`.
## ℹ In group 1: `human = "No"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 7 × 3
##   Variable     Human      `Not Human`
##   <chr>        <chr>      <chr>      
## 1 ""           "(N = 35)" "(N = 48)" 
## 2 ""           ""         ""         
## 3 ""           ""         ""         
## 4 ""           ""         ""         
## 5 "height"     "176.6"    "172.4"    
## 6 "mass"       " 82.8"    "107.6"    
## 7 "birth_year" " 53.4"    "139.3"

4.14 Nesting functions with data masking

One place where the embrace operator can get you in trouble is with nested functions. I ran into this problem when writing the codebook package. In the example below, notice that the name of column we want to analyze (i.e., height) is passed to the x argument of the cb_add_summary_stats() function as a tring (i.e., “height”), and then to the x argument of the cb_summary_stats_numeric() function, and then to the mean() function inside of the summarise() function. Along the way, the association between x and height is lost.

codebook <- function(df) {
  x <- "height"
  cb_add_summary_stats(df, x)
}

cb_add_summary_stats <- function(df, x) {
  cb_summary_stats_numeric(df, x)
}

cb_summary_stats_numeric <- function(df, x) {
  summary <- df %>% 
    summarise(mean = mean({{ x }}, na.rm = TRUE))
  
  summary
}

codebook(study)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `mean = mean(x, na.rm = TRUE)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1    NA

Let’s take a look at what R sees in cb_summary_stats_numeric().

cb_summary_stats_numeric <- function(df, x) {
  rlang::qq_show(
    summary <- df %>% 
      summarise(mean = mean({{ x }}, na.rm = TRUE))
  )
}

codebook(study)
## summary <- df %>% summarise(mean = mean(^x, na.rm = TRUE))

But, what we want R to see is height instead of x. The simplest fix is to use the .data pronoun.

codebook <- function(df) {
  x <- "height"
  cb_add_summary_stats(df, x)
}

cb_add_summary_stats <- function(df, x) {
  cb_summary_stats_numeric(df, x)
}

cb_summary_stats_numeric <- function(df, x) {
  summary <- df %>% 
    summarise(mean = mean(.data[[x]], na.rm = TRUE))

  summary
}

codebook(study)
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  72.3

4.15 Using purrr

I have a situation that I’m a little confused about. I think if I could understand the contrived situation below, then I could figure out my issue.

Essentially, when I use map(x, function(x) { !!x }) with tidyeval I get the result that I expect. When I try to use the purrr shortcut, i.e., map(x, ~ { !!. }) I get an error. I don’t understand why. If anybody has insight that they care to share, I would really appreciate it!

A reprex is below:

This works as expected:

vars <- quos(gender, species)

map_df(vars, function(x){
  starwars %>%
    group_by(!! x) %>%
    summarise(mean(height, na.rm = TRUE))
})
## # A tibble: 41 × 3
##    gender    `mean(height, na.rm = TRUE)` species 
##    <chr>                            <dbl> <chr>   
##  1 feminine                          165. <NA>    
##  2 masculine                         177. <NA>    
##  3 <NA>                              181. <NA>    
##  4 <NA>                               79  Aleena  
##  5 <NA>                              198  Besalisk
##  6 <NA>                              198  Cerean  
##  7 <NA>                              196  Chagrian
##  8 <NA>                              168  Clawdite
##  9 <NA>                              131. Droid   
## 10 <NA>                              112  Dug     
## # … with 31 more rows

I’m not sure why this doesn’t work:

# vars <- quos(gender, species)
# 
# map_df(vars, ~ {
#   starwars %>%
#     group_by(!! .) %>%
#     summarise(mean(height, na.rm = TRUE))
# })

The answer is to use “.x” instead of “.”:

vars <- quos(gender, species)

map_df(vars, ~ {
  starwars %>%
    group_by(!! .x) %>%
    summarise(mean(height, na.rm = TRUE))
})
## # A tibble: 41 × 3
##    gender    `mean(height, na.rm = TRUE)` species 
##    <chr>                            <dbl> <chr>   
##  1 feminine                          165. <NA>    
##  2 masculine                         177. <NA>    
##  3 <NA>                              181. <NA>    
##  4 <NA>                               79  Aleena  
##  5 <NA>                              198  Besalisk
##  6 <NA>                              198  Cerean  
##  7 <NA>                              196  Chagrian
##  8 <NA>                              168  Clawdite
##  9 <NA>                              131. Droid   
## 10 <NA>                              112  Dug     
## # … with 31 more rows

However, I’m not entirely sure why. When I have more time, I’d like to figure this out.

vars <- quos(gender, species)

map(vars, function(x) {
  x
})
## [[1]]
## <quosure>
## expr: ^gender
## env:  global
## 
## [[2]]
## <quosure>
## expr: ^species
## env:  global

4.16 Other Quirks and Lessons Learned

4.16.1 When !! doesn’t work

I’ve noticed that using !! doesn’t always work. At this point, I’m not exactly sure the rules related to when it works and when it doesn’t work, but I do want to write down some examples and fixes.

Sometimes it’s my fault:

example <- function(df, var, ...) {
  x <- enquo(var)
  
  print(!!x) # This doesn't work - need to associate the quoture variable with its data frame
}
starwars %>% example(hair_color)
## Error in `print()`:
## ! Quosures can only be unquoted within a quasiquotation context.
## 
## # Bad: list(!!myquosure)
## 
## # Good: dplyr::mutate(data, !!myquosure)

Fix:

example <- function(df, var, ...) {
  x <- enquo(var)
  
  df %>% select(!!x) %>% print()
}
starwars %>% example(hair_color)
## # A tibble: 87 × 1
##    hair_color   
##    <chr>        
##  1 blond        
##  2 <NA>         
##  3 <NA>         
##  4 none         
##  5 brown        
##  6 brown, grey  
##  7 brown        
##  8 <NA>         
##  9 black        
## 10 auburn, white
## # … with 77 more rows

4.16.2 Unquoting inside non-dplyr functions

I’ve notice some weirdness when trying to unquote quotures inside functions that are inside dplyr functions. For example, if_else inside of mutate.

# This didn't used to work, but it does now
example <- function(df, var) {
  x <- enquo(var)
  
  df %>% 
    mutate(hair_color = if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## # A tibble: 87 × 15
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blonde  fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 5 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, human <chr>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

4.16.3 Using a quoture to create variable name in mutate

Additionally, sometimes there is some trickiness to naming (or overwriting) a variable name inside of mutate.

example <- function(df, var) {
  x <- enquo(var)
  
  df %>% 
    mutate(!!x = if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## Error: <text>:5:16: unexpected '='
## 4:   df %>% 
## 5:     mutate(!!x =
##                   ^

Fix:

# This didn't used to work, but it does now
example <- function(df, var) {
  x <- enquo(var)
  
  df %>% 
    mutate(!!x := if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## # A tibble: 87 × 15
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blonde  fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 5 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, human <chr>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Fix 2:

Must have !! in front of as_name(). Must use := instead of =.

example <- function(df, var) {
  x <- enquo(var)
  
  df %>% 
    mutate(!!rlang::as_name(x) := if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## # A tibble: 87 × 15
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blonde  fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 5 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, human <chr>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Fix 3:

Alternatively, use the embrace operator and glue to make the var names (supported as of rlang 0.4.3).

example <- function(df, var) {
  df %>% 
    mutate(
      "{{ var }}" := if_else({{ var }} == "blond", "blonde", {{ var }}),
      # Even create a new variable
      "new_{{ var }}" := if_else({{ var }} == "blond", "blonde", {{ var }})
    )
}
starwars %>% example(hair_color)
## # A tibble: 87 × 16
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blonde  fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 6 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, human <chr>, new_hair_color <chr>, and
## #   abbreviated variable names ¹​hair_color, ²​skin_color, ³​eye_color,
## #   ⁴​birth_year, ⁵​homeworld

4.16.4 Using a quoture to turn a variable name into a constant value

When I’m looping over many variables, I often want to create a variable in my output called “characteristic” or “variable” that captures the current variable name as a value.

example <- function(df, var) {
  x <- enquo(var)                              # Make sure to use enquo here
  
  df %>% 
    summarise(
      Mean = mean(!!x, na.rm = TRUE)
    ) %>% 
    mutate(Characteristic = !!rlang::as_name(x)) %>% # Make sure to use !!as_name()
    select(Characteristic, Mean)
}

starwars %>% example(height)
## # A tibble: 1 × 2
##   Characteristic  Mean
##   <chr>          <dbl>
## 1 height          174.

Alternatively:

example <- function(df, var) {
  df %>% 
    summarise(
      Mean = mean({{ var }}, na.rm = TRUE)
    ) %>% 
    mutate(Characteristic = !!rlang::as_name(enquo(var))) %>%
    select(Characteristic, Mean)
}

starwars %>% example(height)
## # A tibble: 1 × 2
##   Characteristic  Mean
##   <chr>          <dbl>
## 1 height          174.

4.16.5 Convert a string to a quoture

Here are some useful websites:

https://github.com/tidyverse/rlang/issues/116

https://stackoverflow.com/questions/27975124/pass-arguments-to-dplyr-functions/44594223#44594223

https://stackoverflow.com/questions/44593596/how-to-pass-strings-denoting-expressions-to-dplyr-0-7-verbs/44593617#44593617

Sometimes, I want to pass a variable name as a string to a function. It then needs to be converted to a quoture for evaluation.

4.16.5.1 Simple example - Now this works

my_col <- names(starwars[2]) # Have a variable name as a quoted string
my_col <- "mass"
starwars %>% select(!!my_col) # Now this works
## # A tibble: 87 × 1
##     mass
##    <dbl>
##  1    77
##  2    75
##  3    32
##  4   136
##  5    49
##  6   120
##  7    75
##  8    32
##  9    84
## 10    77
## # … with 77 more rows

4.16.5.2 When the string is created inside the function

starwars$height_squared <- starwars$height**2
example <- function(df, var) {
  
  x <- enquo(var)  # First, turn var without the suffix into a quoture - must be first
  squared <- paste(rlang::as_name(x), "squared", sep = "_") # Must use quo_name()
  
  df %>% 
    summarise(
      Mean = mean(!!squared, na.rm = TRUE)
    )
}

starwars %>% example(height)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `Mean = mean("height_squared", na.rm = TRUE)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## # A tibble: 1 × 1
##    Mean
##   <dbl>
## 1    NA

Fix (Method prefered by Hadley and Lionel):

example <- function(df, var) {
  
  x <- enquo(var)  # First, turn var without the suffix into a quoture - must be first
  squared <- paste(rlang::as_name(x), "squared", sep = "_") # Must use as_name()
  squared <- rlang::sym(squared) # Wrap with sym()

  df %>%
    summarise(
      Mean = mean(!!squared, na.rm = TRUE)
    )
}

starwars %>% example(height)
## # A tibble: 1 × 1
##     Mean
##    <dbl>
## 1 31595.

4.16.5.3 Grouping by all columns in the data frame

Ran into this situation while checking for duplicate rows in APS data (DETECT pilot test)

starwars_2 <- starwars %>% select(-films, -vehicles, -starships) # Remove list columns

starwars_2 %>% 
  group_by(names(starwars)) %>% 
  filter(n() > 1) %>% 
  count() %>% 
  ungroup() %>% 
  select(n)
## Error in `group_by()`:
## ℹ In argument: `names(starwars)`.
## Caused by error:
## ! `names(starwars)` must be size 87 or 1, not 16.

Fix: In this case, we could have used the built-in group_by_all

starwars_2 <- starwars %>% select(-films, -vehicles, -starships) # Remove list columns

starwars_2 %>% 
  group_by_all() %>% 
  filter(n() > 1) %>% 
  count() %>% 
  ungroup() %>% 
  select(n)
## # A tibble: 0 × 1
## # … with 1 variable: n <int>

And, there are no duplicates.

Another, more general solution for using all column names is:

starwars_2 <- starwars %>% select(-films, -vehicles, -starships) # Remove list columns

my_cols <- starwars_2 %>% names() %>% rlang::syms()

starwars_2 %>% 
  group_by(!!!my_cols) %>% # Remember to use splice '!!!'
  filter(n() > 1) %>% 
  count() %>% 
  ungroup() %>% 
  select(n)
## # A tibble: 0 × 1
## # … with 1 variable: n <int>

top

4.17 Example I created for Steph Yap

Need to clean this up, but I don’t have time now.

Here is a worked example using some toy data

aps_cleaned <- tibble(
  case_num = 1:3,
  valid_physical_neglect = c(0, 1, 0),
  valid_sexual_abuse = 0
)
discrepancies_valid_physical_neglect <-tibble(
  case_num = 1,
  valid_physical_neglect = 1
)
discrepancies_valid_sexual_abuse <- tibble(
  case_num = 3,
  valid_sexual_abuse = 1
)

Create the function

Here is a refernce to help with the tidy evaluation stuff: https://dplyr.tidyverse.org/articles/programming.html

join_aps <- function(.data = aps_cleaned, join_df, valid_col) {
  # Create column names to use in the code below
  col_x <- sym(paste0(quo_name(enquo(valid_col)), ".x"))
  col_y <- sym(paste0(quo_name(enquo(valid_col)), ".y"))
  
  .data %>% 
    left_join(join_df, by = "case_num") %>% 
    mutate(
      "{{valid_col}}_cleaned" := if_else(
        is.na({{ col_y }}), {{ col_x }}, {{ col_y }}
      )
    )
}

Test function

aps_cleaned %>% 
  join_aps(discrepancies_valid_physical_neglect, valid_physical_neglect)