4 Tidy Evaluation
Created: 2017-06-24
Updated: 2024-05-31
4.1 ⭐️Overview
This chapter is about tidy evaluation. Tidy evaluation has gone through several fairly significant changes since I first wrote these notes.
Rlang 0.3.1 replaced quo_name()
with as_label()
and as_name()
.
Rlang 0.4.0 add the curly-curly syntax.
4.4 🔢Simulate data
Here we simulate a small dataset that is intended to be representative of data from a research study.
set.seed(123)
study <- tibble(
id = as.character(seq(1001, 1020, 1)),
sex = factor(sample(c("Female", "Male"), 20, TRUE)),
date = sample(seq.Date(as.Date("2021-09-15"), as.Date("2021-10-26"), "day"), 20, TRUE),
days = sample(1L:21L, 20L, TRUE),
height = rnorm(20, 71, 10)
)
# Add missing values for testing
study$id[3] <- NA
study$sex[4] <- NA
study$date[5] <- NA
study$days[6] <- NA
study$height[7] <- NA
4.5 Load Starwars data
We will also load the mtcars
data starwars
data used in many of the Tidyverse examples.
4.6 Why use tidy evaluation?
The short answer is because of data masking, which is easier to see than describe.
## [1] 6.59375
## Error in eval(expr, envir, enclos): object 'cyl' not found
## [1] 6.59375
## mean(cyl + am)
## 1 6.59375
4.7 What is tidy evaluation?
Tidy Eval is as system for programming (i.e., writing new functions), as opposed to working interactively with dplyr.
While data-masking makes it easy to program interactively with data frames, it makes it harder to create functions. Passing data-masked arguments to functions requires injection with the embracing operator {{ or, in more complex cases, the injection operator !!. rlang documentation
So, tidy evaluation, operationalized primarily through the rlang package, is shorthand for a set of tools that allows us to more easily use data masking in the functions that we write. It also allows us to use other functions that use data masking (e.g., dplyr
functions) the functions that we write.
4.8 Vocabulary
The old vocabulary was heavily centered around quasiquotation. It appears as thought the rlang team is moving towards using the terms defusing, embracing and injecting.
Injection (also known as quasiquotation) is a metaprogramming feature that allows you to modify parts of a program. This is needed because under the hood data-masking works by defusing R code to prevent its immediate evaluation. The defused code is resumed later on in a context where data frame columns are defined. rlang documentation
One purpose for defusing evaluation of an expression is to interface with data-masking functions by injecting the expression back into another function with !!. This is the defuse-and-inject pattern. rlang documentation
The defuse-and-inject pattern
my_summarise <- function(data, arg) {
# Defuse the user expression in `arg`
arg <- enquo(arg)
# Inject the expression contained in `arg`
# inside a `summarise()` argument
data |> dplyr::summarise(mean = mean(!!arg, na.rm = TRUE))
}
Defuse-and-inject is usually performed in a single step with the embrace operator {{
.
my_summarise <- function(data, arg) {
# Defuse and inject in a single step with the embracing operator
data |> dplyr::summarise(mean = mean({{ arg }}, na.rm = TRUE))
}
Using enquo() and !! separately is useful in more complex cases where you need access to the defused expression instead of just passing it on. rlang documentation
Defused arguments and quosures
If you inspect the return values of
expr()
andenquo()
, you’ll notice that the latter doesn’t return a raw expression like the former. Instead it returns a quosure, a wrapper containing an expression and an environment. rlang documentation
## 1 + 1
## <quosure>
## expr: ^1 + 1
## env: global
R needs information about the environment to properly evaluate argument expressions because they come from a different context than the current function. rlang documentation
4.9 Key Functions
4.9.1 The qq_show function
The qq_show()
function helps examining injected expressions inside a function. This is useful for learning about injection and for debugging injection code.
my_mean <- function(data, var) {
rlang::qq_show(data %>% dplyr::summarise(mean({{ var }})))
}
mtcars %>% my_mean(cyl)
## data %>% dplyr::summarise(mean(^cyl))
4.9.2 The quo function
The quo()
function creates a class quoture object, which is a special type of formula.
Use quo()
to capture expressions when programming outside of user-defined functions.
## <quosure>
## expr: ^species
## env: global
# Basic usage of quo() in function
freq_table <- function(df, x, ...) {
df %>% # No quoting and unquoting necessary for the tibble
count(!!x) %>% # Don't forget to unquote (!!) where you want the quoture evaluated
top_n(3, n) # Return top 3 results
}
freq_table(df = starwars, x = quo(species))
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Droid 6
## 2 Human 35
## 3 <NA> 4
4.9.3 The enquo function
If you want the user of your function to be able to pass the variable name as an argument without wrapping in quo()
, that’s where enquo()
comes in.
# Basic usage of enquo() in function
freq_table <- function(df, x, ...) {
x <- enquo(x) # Capturing function argument and turning it into a quoture
df %>%
count(!!x) %>%
top_n(3, n)
}
freq_table(df = starwars, x = species) # Notice we no longer need to wrap species with quo()
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Droid 6
## 2 Human 35
## 3 <NA> 4
4.9.4 The embrace operator
As mentioned above in the discussion of the defuse-and-inject pattern, the embrace operator {{
can often be used to defuse-and-inject in a single step.
freq_table <- function(df, x, ...) {
df %>%
count({{ x }}) %>%
top_n(3, n)
}
freq_table(df = starwars, x = species)
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Droid 6
## 2 Human 35
## 3 <NA> 4
Where the embrace operator can get you in trouble is with nested functions (see below) and unquote-splicing (see below).
4.9.5 The quos function
Use quos()
with ...
when you want to pass multiple variables / arguments / expressions into your function. Must unquote-splice !!!
in your function to evaluate.
# What does quos() return?
quos(species, name) # Where species and name are variables in the Starwars tibble
## <list_of<quosure>>
##
## [[1]]
## <quosure>
## expr: ^species
## env: global
##
## [[2]]
## <quosure>
## expr: ^name
## env: global
You can iterate over the list of quotures returned by quos()
## <quosure>
## expr: ^species
## env: global
## <quosure>
## expr: ^name
## env: global
4.9.6 The enquos function
Typically you will use enquos()
instead of quos()
, and use it with the dot-dot-dot argument to a function. When you do, don’t forget to unquote-splice with !!!
.
grouped_mean <- function(df, x, ...) {
mean_var <- enquo(x)
group_vars <- enquos(...)
df %>%
group_by(!!!group_vars) %>%
summarise(mean = mean(!!mean_var), .groups = "drop")
}
grouped_mean(mtcars, disp, cyl, am)
## # A tibble: 6 × 3
## cyl am mean
## <dbl> <dbl> <dbl>
## 1 4 0 136.
## 2 4 1 93.6
## 3 6 0 205.
## 4 6 1 155
## 5 8 0 358.
## 6 8 1 326
Or
freq_table <- function(df, ...) { # Notice we dropped the "x" argument
x <- enquos(...) # Capturing function argument and turning it into a quoture list
df %>%
count(!!!x) %>% # Must use unquote-splice (!!!) in this case
slice(1:5)
}
freq_table(df = starwars, species, hair_color)
## # A tibble: 5 × 3
## species hair_color n
## <chr> <chr> <int>
## 1 Aleena none 1
## 2 Besalisk none 1
## 3 Cerean white 1
## 4 Chagrian none 1
## 5 Clawdite blonde 1
Note that the embrace operator cannot be used to unquote-splice the ...
argument.
freq_table <- function(df, ...) {
df %>%
count({{ ... }}) %>% # Must use unquote-splice (!!!) in this case
slice(1:5)
}
freq_table(df = starwars, species, hair_color)
## Error in eval(expr, envir, enclos): object 'hair_color' not found
4.9.7 The as_label and as_name functions
Rlang 0.3.1 replaced quo_name()
with as_label()
and as_name()
.
Sometimes we want to convert the argument to a string for use in our function output. For example, we may want to dynamically create variable names inside the function.
# What does as_label and as_name return?
# Input must be a string or a quoture
list(
as_label_quotes = rlang::as_label("height"),
as_lable_quoture = rlang::as_label(quo(height)),
as_name_quotes = rlang::as_name("height"),
as_name_quoture = rlang::as_name(quo(height))
)
## $as_label_quotes
## [1] "\"height\""
##
## $as_lable_quoture
## [1] "height"
##
## $as_name_quotes
## [1] "height"
##
## $as_name_quoture
## [1] "height"
I still don’t fully understand when to use one versus the other, but so far, they have been most useful for converting symbols/quotures to character strings.
continuous_table <- function(df, x) {
x <- enquo(x) # Must enquo first
mean_name <- paste0("mean_", rlang::as_name(x))
sum_name <- paste0("sum_", rlang::as_name(x))
df %>%
summarise(
!!mean_name := mean(!!x, na.rm = TRUE), # Must use !! and := to set the variable names
!!sum_name := sum(!!x, na.rm = TRUE)
)
}
continuous_table(starwars, height)
## # A tibble: 1 × 2
## mean_height sum_height
## <dbl> <int>
## 1 175. 14143
Alternatively, with the embrace operator and glue
to make the var names (supported as of rlang 0.4.3).
continuous_table <- function(df, x) {
df %>%
summarise(
"mean_{{ x }}" := mean({{ x }}, na.rm = TRUE), # Must use := to set the variable names
"sum_{{ x }}" := sum({{ x }}, na.rm = TRUE)
)
}
continuous_table(starwars, height)
## # A tibble: 1 × 2
## mean_height sum_height
## <dbl> <int>
## 1 175. 14143
4.9.8 The sym function
The sym()
takes a string as an input and turns it into a symbol.
## starwars %>% summarize(mean(my_col))
Doesn’t work because R will look for a variable named “my_col” in the data frame “starwars”.
## starwars %>% summarize(mean("height"))
Doesn’t work because R will try to calculate the mean of the character string “height”.
## starwars %>% summarize(mean(height))
This looks like what we would type manually.
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 175.
And it works as expected
4.9.9 The syms function
Like sym()
, but can convert multiple strings into a list of symbols
my_cols <- rlang::syms(c("height", "mass"))
rlang::qq_show(
starwars %>%
summarize(
mean(!!my_cols)
)
)
## starwars %>% summarize(mean(<list: height, mass>))
Notice that unquoting with !! returns a list of symbols. To unlist them, we must use the splice operator.
my_cols <- rlang::syms(c("height", "mass"))
rlang::qq_show(
starwars %>%
summarize(
mean(!!!my_cols)
)
)
## starwars %>% summarize(mean(height, mass))
Of course, to make this meaningful we need to map it over height and mass
my_cols <- rlang::syms(c("height", "mass"))
summarise_avg <- function(data, col) {
col <- enquo(col)
data %>%
summarise(avg = mean(!!col, na.rm = TRUE))
}
results <- purrr::map_df(my_cols, summarise_avg, data = starwars)
results
## # A tibble: 2 × 1
## avg
## <dbl>
## 1 175.
## 2 97.3
4.9.10 The rlang prounouns
The rlang
package includes two (as of this writing) pronouns: .data
and .env
. I’m still slightly confused about what these pronouns are (see SO post here), but I’m getting more comfortble with how they are used.
I found a nice example on this blog post.
Because of data masking, filter
looks for cyl
and carb
in mtcars
and it returns rows where the value of cyl
matches the value of carb
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
In this example, because num_cyl
doesn’t exist in mtcars
, filter
will automatically look to the
global environment and return rows where the value of cyl
matches the value of num_cyl
(a constant 6).
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Now, we create an object in the global environment that shares its name with a column in mtcars
- carb
. Because of data masking (and scoping rules), filter
still looks for cyl
and carb
in mtcars
first. Because carb
exists in mtcars
, filter
returns rows where the value of cyl
matches the value of mtcars$carb
- not the carb
object in the global environment.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
But, we can be more explicit (i.e., safer) about using mtcars$carb
with the .data
pronoun.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
OR
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
Similarly, we can use the .env
pronoun to explicity instruct filter
to compare cyl
to the carb
object in the global environment.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
4.10 Ellipsis
In technical language, the three dots argument in R is called an ellipsis. And it means that the function is designed to take any number of named or unnamed arguments. The interesting question is: How do you write functions that make use of ellipsis? The answer is very simple: you simply convert the … to a list, like so:
## [1] "Hello" "World" "!"
So, when should I use the ellipsis argument? That, again, is very simple: there are essentially two situations when you can use the three dots:
When it makes sense to call the function with a variable number of arguments. See the
f
function immediately above. Another very prominent example is thepaste()
function.When, within your function, you call other functions, and these functions can have a variable number of arguments, either because (a) the called function is generic like
print()
or (b) the called function can be passed into the function as an argument, as for example with theFUN
argument inapply()
. (apply <- function (X, MARGIN, FUN, ...)
).
4.11 Dynamic dots
In addition to the base ellipsis syntax, rlang supports something it calls dynamic dots. Programming with dynamic dots (…) presents some opportunities and also some challenges.
You can splice arguments saved in a list with the splice operator
!!!
.You can inject names with
glue
syntax on the left-hand side of:
=.Trailing commas are ignored, making it easier to copy and paste lines of arguments.
If your function takes dots, adding support for dynamic features is as easy as collecting the dots with list2()
instead of list()
. See also dots_list()
, which offers more control over the collection.
In general, passing ...
to a function that supports dynamic dots causes your function to inherit the dynamic behavior.
In packages, document dynamic dots with this standard tag:
@param ... <[
dynamic-dots][rlang::dyn-dots]> What these dots do.
f <- function(...) {
out <- rlang::list2(...)
rev(out)
}
# Trailing commas are ignored
f(this = "that", )
## $this
## [1] "that"
## $omega
## [1] "last"
##
## $alpha
## [1] "first"
# Inject a name using glue syntax
if (rlang::is_installed("glue")) {
nm <- "key"
f("{nm}" := "value")
f("prefix_{nm}" := "value")
}
## $prefix_key
## [1] "value"
Diffuse and inject unquoted column names
## Error: object 'cyl' not found
## Error: object 'cyl' not found
## <list_of<quosure>>
##
## [[1]]
## <quosure>
## expr: ^cyl
## env: 0x103d7ce30
##
## [[2]]
## <quosure>
## expr: ^am
## env: 0x103d7ce30
Now you can inject them into tidyverse functions with the splice operator:
# Must diffuse first
f <- function(.data, ...) {
dot_vars <- enquos(...)
.data %>% count(!!!dot_vars)
}
mtcars %>% f(cyl, am)
## cyl am n
## 1 4 0 3
## 2 4 1 8
## 3 6 0 4
## 4 6 1 3
## 5 8 0 12
## 6 8 1 2
4.11.1 Convert quotures to strings
It took me awhile to figure this out. The answer eventually came from: https://adv-r.hadley.nz/quasiquotation.html#quasi-motivation.
Start by using ensyms()
instead of enquos()
to return naked expressions instead of quosures (https://rlang.r-lib.org/reference/defusing-advanced.html).
## [[1]]
## cyl
##
## [[2]]
## am
Then use purrr::map()
and rlang::as_sting()
or rlang::as_name()
to convert symbols to character strings.
f <- function(.data, ...) {
dot_syms <- rlang::ensyms(...)
purrr::map(dot_syms, rlang::as_name)
}
mtcars %>% f(cyl, am)
## [[1]]
## [1] "cyl"
##
## [[2]]
## [1] "am"
4.11.2 Example: Multiple n-way tables
This example comes from when I was working on the freqtables
package. I was trying to create a wrapper to around a simple purrr
iteration and wanted to use the dot arguments at names to the output list. In other words, take unquoted variable names, diffuse and inject them for analysis, then turn them into character strings.
# Multiple n-way tables
freq_table2 <- function(.data, .freq_var, drop = FALSE) {
.data <- dplyr::count(.data, {{ .freq_var }}, .drop = drop)
.data
}
# For testing
# mtcars %>%
# group_by(am) %>%
# freq_table2(cyl)
# And if you want more than one table
# purrr::map(
# .x = quos(cyl, vs),
# .f = ~ mtcars %>% group_by(am) %>% freq_table2({{ .x }})
# )
# Make a wrapper
freq_tables <- function(.data, ...) {
# Defuse the user expression in `...` for calculations
dot_vars <- enquos(...)
# Make syms and then strings for naming the list
dot_syms <- rlang::ensyms(...)
dot_names <- purrr::map(dot_syms, rlang::as_name)
# Perform the calculations
purrr::map(
.x = dot_vars, # Could also use enquos(...) directly here
.f = ~ .data %>% freq_table2({{ .x }}) # Must use !! or {{
) %>%
rlang::set_names(dot_names)
}
mtcars %>%
freq_tables(cyl, vs)
## $cyl
## cyl n
## 1 4 11
## 2 6 7
## 3 8 14
##
## $vs
## vs n
## 1 0 18
## 2 1 14
4.12 Example: for Loop
In this example, I’m creating a table of summary statistics using the Starwars data. The table will compare some simple characteristics of the characters by species.
First, I’m going to reclassify every character as Human or Not Human
Now I’m going to create the table shell
vars = 3 # Number of vars
rows = vars + 1 # Additional row for group sample size
table <- tibble(
Variable = vector(mode = "character", length = rows),
Human = vector(mode = "character", length = rows),
`Not Human` = vector(mode = "character", length = rows)
)
# N for Human
table[1, 2] <- paste0(
"(N = ",
filter(starwars, human == "Yes") %>% nrow() %>% format(big.mark = ","),
")"
)
# N for Not Human
table[1, 3] <- paste0(
"(N = ",
filter(starwars, human == "No") %>% nrow() %>% format(big.mark = ","),
")"
)
## # A tibble: 4 × 3
## Variable Human `Not Human`
## <chr> <chr> <chr>
## 1 "" "(N = 35)" "(N = 48)"
## 2 "" "" ""
## 3 "" "" ""
## 4 "" "" ""
Finally, I’ll fill in the table using a for loop. In this case, I just want to compare the mean height, mass, and birth year of humans and non-humans.
vars <- quos(height, mass, birth_year) # Create vector of quotures for variables of interest
for(i in seq_along(vars)) {
table[i + 1, ] <- starwars %>% # Row of table to receive loop output
filter(!is.na(human)) %>%
group_by(human) %>%
summarise(Mean = mean(!!vars[[i]], na.rm = TRUE)) %>% # Use !! with vars[[i]]
mutate(Mean = round(Mean, 1) %>% format(nsmall = 1)) %>%
tidyr::pivot_wider(
names_from = human,
values_from = Mean
) %>%
mutate(Variable = rlang::as_name(vars[[i]])) %>% # Use as_name to get variable name for first column
select(Variable, Yes, No)
}
## # A tibble: 4 × 3
## Variable Human `Not Human`
## <chr> <chr> <chr>
## 1 "" "(N = 35)" (N = 48)
## 2 "height" "178.0" 172.4
## 3 "mass" " 81.3" 107.6
## 4 "birth_year" " 53.7" 139.3
4.13 Example: function
In this example, I’m creating a table of summary statistics using the Starwars data. The table will compare some simple characteristics of the characters by species.
First, I’m going to reclassify every character as Human or Not Human
Now I’m going to create the table shell
vars = 3 # Number of vars
rows = vars + 1 # Additional row for group sample size
table <- tibble(
Variable = vector(mode = "character", length = rows),
Human = vector(mode = "character", length = rows),
`Not Human` = vector(mode = "character", length = rows)
)
# N for Human
table[1, 2] <- paste0(
"(N = ",
filter(starwars, human == "Yes") %>% nrow() %>% format(big.mark = ","),
")"
)
# N for Not Human
table[1, 3] <- paste0(
"(N = ",
filter(starwars, human == "No") %>% nrow() %>% format(big.mark = ","),
")"
)
## # A tibble: 4 × 3
## Variable Human `Not Human`
## <chr> <chr> <chr>
## 1 "" "(N = 35)" "(N = 48)"
## 2 "" "" ""
## 3 "" "" ""
## 4 "" "" ""
Finally, I’ll fill in the table using a user-defined function. In this case, I just want to compare the mean height, mass, and birth year of humans and non-humans.
my_stats <- function(df, vars) {
df %>%
filter(!is.na(human)) %>%
group_by(human) %>%
# Calculate means
summarise(
across(
.cols = {{ vars }},
.fns = mean, na.rm = TRUE
)
) %>%
# Format the results
mutate(
across(
.cols = where(is.numeric),
.fns = ~ round(.x, 1) %>% format(nsmall = 1)
)
) %>%
# Restructure results to match the summary table
tidyr::pivot_wider(
names_from = human,
values_from = {{ vars }},
names_sep = "-"
) %>%
tidyr::pivot_longer(
cols = everything(),
names_to = c("Variable", ".value"),
names_sep = "-"
) %>%
# Reorder and rename the columns to match the output table
select(Variable, Human = Yes, `Not Human` = No)
}
# For testing
# my_stats(starwars, c(height, mass, birth_year))
# Or if you prefer to use ...
my_stats <- function(df, ...) {
vars <- enquos(...)
df %>%
filter(!is.na(human)) %>%
group_by(human) %>%
# Calculate means
summarise(
across(
.cols = c(!!!vars),
.fns = mean, na.rm = TRUE
)
)
}
# For testing
# my_stats(starwars, height, mass, birth_year)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.cols = c(height, mass, birth_year), .fns = mean, na.rm =
## TRUE)`.
## ℹ In group 1: `human = "No"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 7 × 3
## Variable Human `Not Human`
## <chr> <chr> <chr>
## 1 "" "(N = 35)" "(N = 48)"
## 2 "" "" ""
## 3 "" "" ""
## 4 "" "" ""
## 5 "height" "178.0" "172.4"
## 6 "mass" " 81.3" "107.6"
## 7 "birth_year" " 53.7" "139.3"
4.14 Nesting functions with data masking
One place where the embrace operator can get you in trouble is with nested functions. I ran into this problem when writing the codebook package. In the example below, notice that the name of column we want to analyze (i.e., height
) is passed to the x
argument of the cb_add_summary_stats()
function as a tring (i.e., “height”), and then to the x
argument of the cb_summary_stats_numeric()
function, and then to the mean()
function inside of the summarise()
function. Along the way, the association between x
and height
is lost.
codebook <- function(df) {
x <- "height"
cb_add_summary_stats(df, x)
}
cb_add_summary_stats <- function(df, x) {
cb_summary_stats_numeric(df, x)
}
cb_summary_stats_numeric <- function(df, x) {
summary <- df %>%
summarise(mean = mean({{ x }}, na.rm = TRUE))
summary
}
codebook(study)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `mean = mean(x, na.rm = TRUE)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 NA
Let’s take a look at what R sees in cb_summary_stats_numeric()
.
cb_summary_stats_numeric <- function(df, x) {
rlang::qq_show(
summary <- df %>%
summarise(mean = mean({{ x }}, na.rm = TRUE))
)
}
codebook(study)
## summary <- df %>% summarise(mean = mean(^x, na.rm = TRUE))
But, what we want R to see is height
instead of x
. The simplest fix is to use the .data
pronoun.
codebook <- function(df) {
x <- "height"
cb_add_summary_stats(df, x)
}
cb_add_summary_stats <- function(df, x) {
cb_summary_stats_numeric(df, x)
}
cb_summary_stats_numeric <- function(df, x) {
summary <- df %>%
summarise(mean = mean(.data[[x]], na.rm = TRUE))
summary
}
codebook(study)
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 72.3
4.15 Using purrr
I have a situation that I’m a little confused about. I think if I could understand the contrived situation below, then I could figure out my issue.
Essentially, when I use map(x, function(x) { !!x }) with tidyeval I get the result that I expect. When I try to use the purrr shortcut, i.e., map(x, ~ { !!. }) I get an error. I don’t understand why. If anybody has insight that they care to share, I would really appreciate it!
A reprex is below:
This works as expected:
vars <- quos(gender, species)
map_df(vars, function(x){
starwars %>%
group_by(!! x) %>%
summarise(mean(height, na.rm = TRUE))
})
## # A tibble: 41 × 3
## gender `mean(height, na.rm = TRUE)` species
## <chr> <dbl> <chr>
## 1 feminine 167. <NA>
## 2 masculine 177. <NA>
## 3 <NA> 175 <NA>
## 4 <NA> 79 Aleena
## 5 <NA> 198 Besalisk
## 6 <NA> 198 Cerean
## 7 <NA> 196 Chagrian
## 8 <NA> 168 Clawdite
## 9 <NA> 131. Droid
## 10 <NA> 112 Dug
## # ℹ 31 more rows
I’m not sure why this doesn’t work:
# vars <- quos(gender, species)
#
# map_df(vars, ~ {
# starwars %>%
# group_by(!! .) %>%
# summarise(mean(height, na.rm = TRUE))
# })
The answer is to use “.x” instead of “.”:
vars <- quos(gender, species)
map_df(vars, ~ {
starwars %>%
group_by(!! .x) %>%
summarise(mean(height, na.rm = TRUE))
})
## # A tibble: 41 × 3
## gender `mean(height, na.rm = TRUE)` species
## <chr> <dbl> <chr>
## 1 feminine 167. <NA>
## 2 masculine 177. <NA>
## 3 <NA> 175 <NA>
## 4 <NA> 79 Aleena
## 5 <NA> 198 Besalisk
## 6 <NA> 198 Cerean
## 7 <NA> 196 Chagrian
## 8 <NA> 168 Clawdite
## 9 <NA> 131. Droid
## 10 <NA> 112 Dug
## # ℹ 31 more rows
However, I’m not entirely sure why. When I have more time, I’d like to figure this out.
## [[1]]
## <quosure>
## expr: ^gender
## env: global
##
## [[2]]
## <quosure>
## expr: ^species
## env: global
4.16 Other Quirks and Lessons Learned
4.16.1 When !! doesn’t work
I’ve noticed that using !!
doesn’t always work. At this point, I’m not exactly sure the rules related to when it works and when it doesn’t work, but I do want to write down some examples and fixes.
Sometimes it’s my fault:
example <- function(df, var, ...) {
x <- enquo(var)
print(!!x) # This doesn't work - need to associate the quoture variable with its data frame
}
starwars %>% example(hair_color)
## Error in `print()`:
## ! Quosures can only be unquoted within a quasiquotation context.
##
## # Bad: list(!!myquosure)
##
## # Good: dplyr::mutate(data, !!myquosure)
Fix:
example <- function(df, var, ...) {
x <- enquo(var)
df %>% select(!!x) %>% print()
}
starwars %>% example(hair_color)
## # A tibble: 87 × 1
## hair_color
## <chr>
## 1 blond
## 2 <NA>
## 3 <NA>
## 4 none
## 5 brown
## 6 brown, grey
## 7 brown
## 8 <NA>
## 9 black
## 10 auburn, white
## # ℹ 77 more rows
4.16.2 Unquoting inside non-dplyr functions
I’ve notice some weirdness when trying to unquote quotures inside functions that are inside dplyr functions. For example, if_else inside of mutate.
# This didn't used to work, but it does now
example <- function(df, var) {
x <- enquo(var)
df %>%
mutate(hair_color = if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## # A tibble: 87 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blonde fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, human <chr>
4.16.3 Using a quoture to create variable name in mutate
Additionally, sometimes there is some trickiness to naming (or overwriting) a variable name inside of mutate.
example <- function(df, var) {
x <- enquo(var)
df %>%
mutate(!!x = if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## Error: <text>:5:16: unexpected '='
## 4: df %>%
## 5: mutate(!!x =
## ^
Fix:
# This didn't used to work, but it does now
example <- function(df, var) {
x <- enquo(var)
df %>%
mutate(!!x := if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## # A tibble: 87 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blonde fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, human <chr>
Fix 2:
Must have !! in front of as_name(). Must use := instead of =.
example <- function(df, var) {
x <- enquo(var)
df %>%
mutate(!!rlang::as_name(x) := if_else(!!x == "blond", "blonde", !!x))
}
starwars %>% example(hair_color)
## # A tibble: 87 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blonde fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, human <chr>
Fix 3:
Alternatively, use the embrace operator and glue
to make the var names (supported as of rlang 0.4.3).
example <- function(df, var) {
df %>%
mutate(
"{{ var }}" := if_else({{ var }} == "blond", "blonde", {{ var }}),
# Even create a new variable
"new_{{ var }}" := if_else({{ var }} == "blond", "blonde", {{ var }})
)
}
starwars %>% example(hair_color)
## # A tibble: 87 × 16
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blonde fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, human <chr>, new_hair_color <chr>
4.16.4 Using a quoture to turn a variable name into a constant value
When I’m looping over many variables, I often want to create a variable in my output called “characteristic” or “variable” that captures the current variable name as a value.
example <- function(df, var) {
x <- enquo(var) # Make sure to use enquo here
df %>%
summarise(
Mean = mean(!!x, na.rm = TRUE)
) %>%
mutate(Characteristic = !!rlang::as_name(x)) %>% # Make sure to use !!as_name()
select(Characteristic, Mean)
}
starwars %>% example(height)
## # A tibble: 1 × 2
## Characteristic Mean
## <chr> <dbl>
## 1 height 175.
Alternatively:
example <- function(df, var) {
df %>%
summarise(
Mean = mean({{ var }}, na.rm = TRUE)
) %>%
mutate(Characteristic = !!rlang::as_name(enquo(var))) %>%
select(Characteristic, Mean)
}
starwars %>% example(height)
## # A tibble: 1 × 2
## Characteristic Mean
## <chr> <dbl>
## 1 height 175.
4.16.5 Convert a string to a quoture
Here are some useful websites:
https://github.com/tidyverse/rlang/issues/116
https://stackoverflow.com/questions/27975124/pass-arguments-to-dplyr-functions/44594223#44594223
Sometimes, I want to pass a variable name as a string to a function. It then needs to be converted to a quoture for evaluation.
4.16.5.1 Simple example - Now this works
my_col <- names(starwars[2]) # Have a variable name as a quoted string
my_col <- "mass"
starwars %>% select(!!my_col) # Now this works
## # A tibble: 87 × 1
## mass
## <dbl>
## 1 77
## 2 75
## 3 32
## 4 136
## 5 49
## 6 120
## 7 75
## 8 32
## 9 84
## 10 77
## # ℹ 77 more rows
4.16.5.2 When the string is created inside the function
example <- function(df, var) {
x <- enquo(var) # First, turn var without the suffix into a quoture - must be first
squared <- paste(rlang::as_name(x), "squared", sep = "_") # Must use quo_name()
df %>%
summarise(
Mean = mean(!!squared, na.rm = TRUE)
)
}
starwars %>% example(height)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `Mean = mean("height_squared", na.rm = TRUE)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## # A tibble: 1 × 1
## Mean
## <dbl>
## 1 NA
Fix (Method prefered by Hadley and Lionel):
example <- function(df, var) {
x <- enquo(var) # First, turn var without the suffix into a quoture - must be first
squared <- paste(rlang::as_name(x), "squared", sep = "_") # Must use as_name()
squared <- rlang::sym(squared) # Wrap with sym()
df %>%
summarise(
Mean = mean(!!squared, na.rm = TRUE)
)
}
starwars %>% example(height)
## # A tibble: 1 × 1
## Mean
## <dbl>
## 1 31681.
4.16.5.3 Grouping by all columns in the data frame
Ran into this situation while checking for duplicate rows in APS data (DETECT pilot test)
starwars_2 <- starwars %>% select(-films, -vehicles, -starships) # Remove list columns
starwars_2 %>%
group_by(names(starwars)) %>%
filter(n() > 1) %>%
count() %>%
ungroup() %>%
select(n)
## Error in `group_by()`:
## ℹ In argument: `names(starwars)`.
## Caused by error:
## ! `names(starwars)` must be size 87 or 1, not 16.
Fix: In this case, we could have used the built-in group_by_all
starwars_2 <- starwars %>% select(-films, -vehicles, -starships) # Remove list columns
starwars_2 %>%
group_by_all() %>%
filter(n() > 1) %>%
count() %>%
ungroup() %>%
select(n)
## # A tibble: 0 × 1
## # ℹ 1 variable: n <int>
And, there are no duplicates.
Another, more general solution for using all column names is:
starwars_2 <- starwars %>% select(-films, -vehicles, -starships) # Remove list columns
my_cols <- starwars_2 %>% names() %>% rlang::syms()
starwars_2 %>%
group_by(!!!my_cols) %>% # Remember to use splice '!!!'
filter(n() > 1) %>%
count() %>%
ungroup() %>%
select(n)
## # A tibble: 0 × 1
## # ℹ 1 variable: n <int>
4.16.6 Setting a default function parameter value to NULL
I was trying to create a function that would produce histograms while working on L2C. Sometimes I wanted the histograms faceted by group and sometimes I didn’t. So, I wanted to set the facet variable to NULL
by default. But I kept getting an error: “Error: object ‘variable name’ not found”.
Here is a reproducible example. It isn’t a histogram. Instead, it’s much simpler code, but it produces the same error and has the same solution.
# Won't work
test_null <- function(df, x = NULL) {
if (is.null(x)) {
dplyr::select(df, name)
} else {
dplyr::select(df, {{ x }})
}
}
# Produces an error
test_null(starwars, mass)
## Error in eval(expr, envir, enclos): object 'mass' not found
I used this SO post to find a solution.
# This works
test_null <- function(df, x = NULL) {
# First, enquo x
x_enquo <- rlang::enquo(x)
# Use rlang::quo_is_null to check for a null value
if (rlang::quo_is_null(x_enquo)) {
select(df, name)
} else {
# Inject the defued x with !!
select(df, !! x_enquo)
}
}
## # A tibble: 87 × 1
## name
## <chr>
## 1 Luke Skywalker
## 2 C-3PO
## 3 R2-D2
## 4 Darth Vader
## 5 Leia Organa
## 6 Owen Lars
## 7 Beru Whitesun Lars
## 8 R5-D4
## 9 Biggs Darklighter
## 10 Obi-Wan Kenobi
## # ℹ 77 more rows
## # A tibble: 87 × 1
## mass
## <dbl>
## 1 77
## 2 75
## 3 32
## 4 136
## 5 49
## 6 120
## 7 75
## 8 32
## 9 84
## 10 77
## # ℹ 77 more rows
4.17 Example I created for Steph Yap
Need to clean this up, but I don’t have time now.
Here is a worked example using some toy data
aps_cleaned <- tibble(
case_num = 1:3,
valid_physical_neglect = c(0, 1, 0),
valid_sexual_abuse = 0
)
discrepancies_valid_physical_neglect <-tibble(
case_num = 1,
valid_physical_neglect = 1
)
discrepancies_valid_sexual_abuse <- tibble(
case_num = 3,
valid_sexual_abuse = 1
)
Create the function
Here is a refernce to help with the tidy evaluation stuff: https://dplyr.tidyverse.org/articles/programming.html
join_aps <- function(.data = aps_cleaned, join_df, valid_col) {
# Create column names to use in the code below
col_x <- sym(paste0(quo_name(enquo(valid_col)), ".x"))
col_y <- sym(paste0(quo_name(enquo(valid_col)), ".y"))
.data %>%
left_join(join_df, by = "case_num") %>%
mutate(
"{{valid_col}}_cleaned" := if_else(
is.na({{ col_y }}), {{ col_x }}, {{ col_y }}
)
)
}
Test function