35 Column-wise Operations in dplyr

Throughout the chapters in this book we have learned to do a really vast array of useful data transformations and statistical analyses with the help of the dplyr package.

So far, however, we’ve always done these transformations and statistical analyses on one column of our data frame at a time. There isn’t anything inherently “wrong” with this approach, but, for reasons we’ve already discussed, there are often advantages to telling R what you want to do one time, and then asking R to do that thing repeatedly across all, or a subset of, the columns in your data frame. That is exactly what dplyr’s across() function allows us to do.

There are so many ways we might want to use the across() function in our R programs. We can’t begin to cover, or even imagine, them all. Instead, the goal of this chapter is just to provide you with an overview of the across() function and show you some examples of using it with filter(), mutate(), and summarise() to get you thinking about how you might want to use it in your R programs.

Before we discuss further, let’s take a look at a quick example. The first thing we will need to do is load dplyr.

library(dplyr, warn.conflicts = FALSE)

Then, we will simulate some data. In this case, we are creating a data frame that contains three columns of 10 random numbers:

set.seed(123)
df_xyz <- tibble(
  row = 1:10,
  x   = rnorm(10),
  y   = rnorm(10),
  z   = rnorm(10)
) %>% 
  print()
## # A tibble: 10 × 4
##      row       x      y      z
##    <int>   <dbl>  <dbl>  <dbl>
##  1     1 -0.560   1.22  -1.07 
##  2     2 -0.230   0.360 -0.218
##  3     3  1.56    0.401 -1.03 
##  4     4  0.0705  0.111 -0.729
##  5     5  0.129  -0.556 -0.625
##  6     6  1.72    1.79  -1.69 
##  7     7  0.461   0.498  0.838
##  8     8 -1.27   -1.97   0.153
##  9     9 -0.687   0.701 -1.14 
## 10    10 -0.446  -0.473  1.25

Up to this point, if we wanted to find the mean of each column, we would probably have written code like this:

df_xyz %>% 
  summarise(
    x_mean = mean(x),
    y_mean = mean(y),
    z_mean = mean(y)
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1 0.0746  0.209  0.209

With the help of the across() function, we can now get the mean of each column like this:

df_xyz %>%
  summarise(
    across(
      .cols  = c(x:z),
      .fns   = mean,
      .names = "{col}_mean"
    )
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1 0.0746  0.209 -0.425

Now, you might ask why this is a better approach. Fair question.

In this case, using across() doesn’t actually reduce the number of lines of code we wrote. In fact, we wrote two additional lines when we used the across() function. However, imagine if we added 20 additional columns to our data frame. Using the first approach, we would have to write 20 additional lines of code inside the summarise() function. Using the across() approach, we wouldn’t have to add any additional code at all. We would simply update the value we pass to the .cols argument.

Perhaps more importantly, did you notice that we “accidentally” forgot to replace y with z when we copied and pasted z_mean = mean(y) in the code chunk for the first approach? If not, go back and take a look. That mistake is fairly easy to catch and fix in this very simple example. But, in real-world projects, mistakes like this are easy to make, and not always so easy to catch. We are much less likely to make similar mistakes when we use across().

35.1 The across() function

The across() function is part of the dplyr package. We will always use across() inside of one of the dplyr verbs we’ve been learning about. Specifically, mutate(), and summarise(). We will not use across() outside of the dplyr verbs. Additionally, we will always use across() within the context of a data frame (as opposed to a vector, matrix, or some other data structure).

To view the help documentation for across(), you can copy and paste ?dplyr::across into your R console. If you do, you will see that across() has four arguments. They are:

1️⃣.cols. The value we pass to this argument should be columns of the data frame we want to operate on. We can once again use tidy-select argument modifiers here. In the example above, we used c(x:z) to tell R that we wanted to operate on columns x through z (inclusive). If we had also wanted the mean of the row column for some reason, we could have used the everything() tidy-select modifier to tell R that we wanted to operate on all of the columns in the data frame.

2️⃣.fns. This is where you tell across() what function, or functions, you want to apply to the columns you selected in .cols. In the example above, we passed the mean function to the .fns argument. Notice that we typed mean without the parentheses (i.e., mean()).

3️⃣.... In this case, the ... argument is where we pass any additional arguments to the function we passed to the .fns argument. For example, we passed the mean function to the .fns argument above. In the data frame above, none of the columns had any missing values. Let’s go ahead and add some missing values so that we can take a look at how ... works in across().

df_xyz$x[2] <- NA_real_
df_xyz$y[4] <- NA_real_
df_xyz$z[6] <- NA_real_
df_xyz
## # A tibble: 10 × 4
##      row       x      y      z
##    <int>   <dbl>  <dbl>  <dbl>
##  1     1 -0.560   1.22  -1.07 
##  2     2 NA       0.360 -0.218
##  3     3  1.56    0.401 -1.03 
##  4     4  0.0705 NA     -0.729
##  5     5  0.129  -0.556 -0.625
##  6     6  1.72    1.79  NA    
##  7     7  0.461   0.498  0.838
##  8     8 -1.27   -1.97   0.153
##  9     9 -0.687   0.701 -1.14 
## 10    10 -0.446  -0.473  1.25

As we’ve already seen many times, R won’t drop the missing values and carry out a complete case analysis by default:

df_xyz %>% 
  summarise(
    x_mean = mean(x),
    y_mean = mean(y),
    z_mean = mean(y)
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1     NA     NA     NA

Instead, we have to explicitly tell R to carry out a complete case analysis. We can do so by filtering our rows with missing data (more on this later) or by changing the value of the mean() function’s na.rm argument from FALSE (the default) to TRUE:

df_xyz %>% 
  summarise(
    x_mean = mean(x, na.rm = TRUE),
    y_mean = mean(y, na.rm = TRUE),
    z_mean = mean(z, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1  0.108  0.220 -0.284

When we use across(), we will need to pass the na.rm = TRUE to the mean() function in across()’s ... argument like this:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean,
      na.rm  = TRUE, # Passing na.rm = TRUE to the ... argument
      .names = "{col}_mean"
    )
  )
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.cols = everything(), .fns = mean, na.rm = TRUE, .names = "{col}_mean")`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 4
##   row_mean x_mean y_mean z_mean
##      <dbl>  <dbl>  <dbl>  <dbl>
## 1      5.5  0.108  0.220 -0.284

Notice that we do not actually type out ... = or anything like that.

4️⃣.names. You can use this argument to adjust the column names that will result from the operation you pass to .fns. In the example above, we used the special {cols} keyword to use each of the column names that were passed to the .cols argument as the first part of each of the new columns’ names. Then, we asked R to add a literal underscore and the word “mean” because these are all mean values. That resulted in the new column names you see above. The default value for .names is just {cols}. So, if we hadn’t modified the value passed to the .names argument, our results would have looked like this:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean,
      na.rm  = TRUE
    )
  )
## # A tibble: 1 × 4
##     row     x     y      z
##   <dbl> <dbl> <dbl>  <dbl>
## 1   5.5 0.108 0.220 -0.284

There is also a special {fn} keyword that we can use to pass the name of each of the functions we used in .fns as part of the new column names. However, in order to get {fn} to work the way we want it to, we have to pass a list of name-function pairs to the .fns argument. Let me show you what we mean.

First, we will keep the code exactly as it was, but replace “mean” with “{fn}” in the .names argument:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean, 
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 4
##   row_1   x_1   y_1    z_1
##   <dbl> <dbl> <dbl>  <dbl>
## 1   5.5 0.108 0.220 -0.284

This is not the result we wanted. Because, we didn’t name the function that we passed to .fns, across() essentially used “function number 1” as its name. In order to get the result we want, we need to pass a list of name-function pairs to the .fns argument like this:

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(mean = mean),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 4
##   row_mean x_mean y_mean z_mean
##      <dbl>  <dbl>  <dbl>  <dbl>
## 1      5.5  0.108  0.220 -0.284

Although it may not be self-evident from just looking at the code above, the first mean in the list(mean = mean) name-function pair is a name that we are choosing to be passed to the new column names. Theoretically, we could have picked any name. For example:

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(r4epi = mean),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 4
##   row_r4epi x_r4epi y_r4epi z_r4epi
##       <dbl>   <dbl>   <dbl>   <dbl>
## 1       5.5   0.108   0.220  -0.284

The second mean in the list(mean = mean) name-function pair is the name of the actual function we want to apply to the columns in .cols. This part of the name-function pair must be the name of the function that we actually want to apply to the columns in .cols. Otherwise, we will get an error:

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(mean = r4epi),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## Error in `summarise()`:
## ℹ In argument: `across(...)`.
## Caused by error in `is_missing()`:
## ! object 'r4epi' not found

An additional advantage of passing a list of name-function pairs to the .fns argument is that we can pass multiple functions at once. For example, let’s say that we want the minimum and maximum value of each column in our data frame. Without across() we might do that analysis like this:

df_xyz %>% 
  summarise(
    x_min = min(x, na.rm = TRUE),
    x_max = max(x, na.rm = TRUE),
    y_min = min(y, na.rm = TRUE),
    y_max = max(y, na.rm = TRUE),
    z_min = min(z, na.rm = TRUE),
    z_max = max(z, na.rm = TRUE)
  )
## # A tibble: 1 × 6
##   x_min x_max y_min y_max z_min z_max
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.27  1.72 -1.97  1.79 -1.14  1.25

But, we can simply pass min and max as a list of name-function pairs if we use across():

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(min = min, max = max),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 8
##   row_min row_max x_min x_max y_min y_max z_min z_max
##     <int>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       1      10 -1.27  1.72 -1.97  1.79 -1.14  1.25

How great is that?!?

So, we’ve seen how to pass an individual function to the .fns argument and we’ve seen how to pass a list containing multiple functions to the .fns argument. There is actually a third syntax for passing functions to the .fns argument. The across() documentation calls it “a purrr-style lambda”. This can be a little bit confusing, so I’m going to show you an example, and then walk through it step by step.

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = ~ mean(.x, na.rm = TRUE),
      .names = "{col}_mean"
    )
  )
## # A tibble: 1 × 4
##   row_mean x_mean y_mean z_mean
##      <dbl>  <dbl>  <dbl>  <dbl>
## 1      5.5  0.108  0.220 -0.284

The purrr-style lambda always begins with the tilde symbol (~). Then we type out a function call behind the tilde symbol. We place the special .x symbol inside the function call where we would normally want to type the name of the column we want the function to operate on. The across() function will then substitute each column name we passed to the .cols argument for .x sequentially. In the example above, there isn’t really any good reason to use this syntax. However, this syntax can be useful at times. We will see some examples below.

35.2 Across with mutate

We’ve already seen a number of examples of manipulating columns of our data frames using the mutate() function. In this section, we are going to take a look at two examples where using the across() function inside mutate() will allow us to apply the same manipulation to multiple columns in our data frame at once.

Let’s go ahead and simulate the same demographics data frame we simulated for the recoding missing section of the conditional operations chapter. Let’s also add two new columns: a four-category education column and a six-category income column. For all columns except id and age, a value of 7 represents “Don’t know” and a value of 9 represents “refused.”

set.seed(123)
demographics <- tibble(
  id       = 1:10,
  age      = c(sample(1:30, 9, TRUE), NA),
  race     = c(1, 2, 1, 4, 7, 1, 2, 9, 1, 3),
  hispanic = c(7, 0, 1, 0, 1, 0, 1, 9, 0, 1),
  edu_4cat = c(4, 2, 9, 1, 2, 3, 4, 9, 3, 3),
  inc_6cat = c(1, 4, 1, 1, 5, 3, 2, 2, 7, 9)
) %>% 
  print()
## # A tibble: 10 × 6
##       id   age  race hispanic edu_4cat inc_6cat
##    <int> <int> <dbl>    <dbl>    <dbl>    <dbl>
##  1     1    15     1        7        4        1
##  2     2    19     2        0        2        4
##  3     3    14     1        1        9        1
##  4     4     3     4        0        1        1
##  5     5    10     7        1        2        5
##  6     6    18     1        0        3        3
##  7     7    22     2        1        4        2
##  8     8    11     9        9        9        2
##  9     9     5     1        0        3        7
## 10    10    NA     3        1        3        9

When working with data like this, it’s common to want to recode all the 7’s and 9’s to NA’s. We saw how to do that one column at a time already:

demographics %>% 
  mutate(
    race     = if_else(race == 7 | race == 9, NA_real_, race),
    hispanic = if_else(race == 7 | hispanic == 9, NA_real_, hispanic),
    edu_4cat = if_else(edu_4cat == 7 | edu_4cat == 9, NA_real_, edu_4cat)
  )
## # A tibble: 10 × 6
##       id   age  race hispanic edu_4cat inc_6cat
##    <int> <int> <dbl>    <dbl>    <dbl>    <dbl>
##  1     1    15     1        7        4        1
##  2     2    19     2        0        2        4
##  3     3    14     1        1       NA        1
##  4     4     3     4        0        1        1
##  5     5    10    NA       NA        2        5
##  6     6    18     1        0        3        3
##  7     7    22     2        1        4        2
##  8     8    11    NA       NA       NA        2
##  9     9     5     1        0        3        7
## 10    10    NA     3        1        3        9

🚩In the code chunk above, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code.

Also, did you notice that we forgot to replace race with hispanic in hispanic = if_else(race == 7 | hispanic == 9, NA_real_, hispanic)? This time, we didn’t write “forgot” in quotes because we really did forget and only noticed it later. In this case, the error caused a value of 1 to be recoded to NA in the hispanic column. These typos we’ve been talking about really do happen – even to me!

Here’s how we can use across() in this situation:

demographics %>% 
  mutate(
    across(
      .cols = c(-id, -age),
      .fns  = ~ if_else(.x == 7 | .x == 9, NA_real_, .x)
    )
  )
## # A tibble: 10 × 6
##       id   age  race hispanic edu_4cat inc_6cat
##    <int> <int> <dbl>    <dbl>    <dbl>    <dbl>
##  1     1    15     1       NA        4        1
##  2     2    19     2        0        2        4
##  3     3    14     1        1       NA        1
##  4     4     3     4        0        1        1
##  5     5    10    NA        1        2        5
##  6     6    18     1        0        3        3
##  7     7    22     2        1        4        2
##  8     8    11    NA       NA       NA        2
##  9     9     5     1        0        3       NA
## 10    10    NA     3        1        3       NA

👆Here’s what we did above:

  • We used a purrr-style lambda to replace 7’s and 9’s in all columns in our data frame, except id and age, with NA.

  • Remember, the special .x symbol is just shorthand for each column passed to the .cols argument.

As another example, let’s say that we are once again working with data from a drug trial that includes a list of side effects for each person:

set.seed(123)
drug_trial <- tibble(
  id           = 1:10,
  se_headache  = sample(0:1, 10, TRUE),
  se_diarrhea  = sample(0:1, 10, TRUE),
  se_dry_mouth = sample(0:1, 10, TRUE),
  se_nausea    = sample(0:1, 10, TRUE)
) %>% 
 print()
## # A tibble: 10 × 5
##       id se_headache se_diarrhea se_dry_mouth se_nausea
##    <int>       <int>       <int>        <int>     <int>
##  1     1           0           1            0         0
##  2     2           0           1            1         1
##  3     3           0           1            0         0
##  4     4           1           0            0         1
##  5     5           0           1            0         1
##  6     6           1           0            0         0
##  7     7           1           1            1         0
##  8     8           1           0            1         0
##  9     9           0           0            0         0
## 10    10           0           0            1         1

Now, we want to create a factor version of each of the side effect columns. We’ve already learned how to do so one column at a time:

drug_trial %>% 
  mutate(
    se_headache_f  = factor(se_headache, 0:1, c("No", "Yes")),
    se_diarrhea_f  = factor(se_diarrhea, 0:1, c("No", "Yes")),
    se_dry_mouth_f = factor(se_dry_mouth, 0:1, c("No", "Yes"))
  )
## # A tibble: 10 × 8
##       id se_headache se_diarrhea se_dry_mouth se_nausea se_headache_f se_diarrhea_f se_dry_mouth_f
##    <int>       <int>       <int>        <int>     <int> <fct>         <fct>         <fct>         
##  1     1           0           1            0         0 No            Yes           No            
##  2     2           0           1            1         1 No            Yes           Yes           
##  3     3           0           1            0         0 No            Yes           No            
##  4     4           1           0            0         1 Yes           No            No            
##  5     5           0           1            0         1 No            Yes           No            
##  6     6           1           0            0         0 Yes           No            No            
##  7     7           1           1            1         0 Yes           Yes           Yes           
##  8     8           1           0            1         0 Yes           No            Yes           
##  9     9           0           0            0         0 No            No            No            
## 10    10           0           0            1         1 No            No            Yes

🚩Once again, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code. Here’s how we can use across() to do so:

drug_trial %>% 
  mutate(
    across(
      .cols  = starts_with("se"),
      .fns   = ~ factor(.x, 0:1, c("No", "Yes")),
      .names = "{col}_f"
    )
  )
## # A tibble: 10 × 9
##       id se_headache se_diarrhea se_dry_mouth se_nausea se_headache_f se_diarrhea_f se_dry_mouth_f se_nausea_f
##    <int>       <int>       <int>        <int>     <int> <fct>         <fct>         <fct>          <fct>      
##  1     1           0           1            0         0 No            Yes           No             No         
##  2     2           0           1            1         1 No            Yes           Yes            Yes        
##  3     3           0           1            0         0 No            Yes           No             No         
##  4     4           1           0            0         1 Yes           No            No             Yes        
##  5     5           0           1            0         1 No            Yes           No             Yes        
##  6     6           1           0            0         0 Yes           No            No             No         
##  7     7           1           1            1         0 Yes           Yes           Yes            No         
##  8     8           1           0            1         0 Yes           No            Yes            No         
##  9     9           0           0            0         0 No            No            No             No         
## 10    10           0           0            1         1 No            No            Yes            Yes

👆Here’s what we did above:

  • We used a purrr-style lambda to create a factor version of all the side effect columns in our data frame.

  • We used the .names argument to add an “_f” to the end of the new column names.

35.3 Across with summarise

Let’s return to the ehr data frame we used in the chapter on working with character strings for our first example of using across() inside of summarise.

You may click here to download this file to your computer.

# We will need readr and stringr in the examples below
library(readr)
library(stringr)
# Read in the data
ehr <- read_rds("data/ehr.Rds")

For this example, the only column we will concern ourselves with is the symptoms column:

symptoms <- ehr %>% 
  select(symptoms) %>% 
  print()
## # A tibble: 15 × 1
##    symptoms                            
##    <chr>                               
##  1 "\"Pain\", \"Headache\", \"Nausea\""
##  2 "Pain"                              
##  3 "Pain"                              
##  4 "\"Nausea\", \"Headache\""          
##  5 "\"Pain\", \"Headache\""            
##  6 "\"Nausea\", \"Headache\""          
##  7 "Pain"                              
##  8  <NA>                               
##  9 "Pain"                              
## 10  <NA>                               
## 11 "\"Nausea\", \"Headache\""          
## 12 "\"Headache\", \"Pain\", \"Nausea\""
## 13 "Headache"                          
## 14 "\"Headache\", \"Pain\", \"Nausea\""
## 15  <NA>

You may recall that we created dummy variables for each symptom like this:

symptoms <- symptoms %>% 
  mutate(
    pain     = str_detect(symptoms, "Pain"),
    headache = str_detect(symptoms, "Headache"),
    nausea   = str_detect(symptoms, "Nausea")
  ) %>% 
  print()
## # A tibble: 15 × 4
##    symptoms                             pain  headache nausea
##    <chr>                                <lgl> <lgl>    <lgl> 
##  1 "\"Pain\", \"Headache\", \"Nausea\"" TRUE  TRUE     TRUE  
##  2 "Pain"                               TRUE  FALSE    FALSE 
##  3 "Pain"                               TRUE  FALSE    FALSE 
##  4 "\"Nausea\", \"Headache\""           FALSE TRUE     TRUE  
##  5 "\"Pain\", \"Headache\""             TRUE  TRUE     FALSE 
##  6 "\"Nausea\", \"Headache\""           FALSE TRUE     TRUE  
##  7 "Pain"                               TRUE  FALSE    FALSE 
##  8  <NA>                                NA    NA       NA    
##  9 "Pain"                               TRUE  FALSE    FALSE 
## 10  <NA>                                NA    NA       NA    
## 11 "\"Nausea\", \"Headache\""           FALSE TRUE     TRUE  
## 12 "\"Headache\", \"Pain\", \"Nausea\"" TRUE  TRUE     TRUE  
## 13 "Headache"                           FALSE TRUE     FALSE 
## 14 "\"Headache\", \"Pain\", \"Nausea\"" TRUE  TRUE     TRUE  
## 15  <NA>                                NA    NA       NA

🗒Side Note: Some of you may have noticed that we repeated ourselves more than twice in the code chunk above and thought about using across() to remove it. Unfortunately, across() won’t solve our problem in this situation. We will need some of the tools that we learn about in later chapters if we want to remove this repetition.

And finally, we used the table() function to get a count of how many people reported having a headache:

table(symptoms$headache)
## 
## FALSE  TRUE 
##     4     8

This is where the example stopped in the chapter on working with character strings. However, what if we wanted to know how many people reported the other symptoms as well? Well, we could repeatedly call the table() function:

table(symptoms$pain)
## 
## FALSE  TRUE 
##     4     8
table(symptoms$nausea)
## 
## FALSE  TRUE 
##     6     6

But, that would cause us to copy and paste repeatedly. Additionally, wouldn’t it be nice to view these counts in a way that makes them easier to compare? One solution would be to use summarise() like this:

symptoms %>% 
  summarise(
    had_headache = sum(headache, na.rm = TRUE),
    had_pain     = sum(pain, na.rm = TRUE),
    had_nausea   = sum(nausea, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##   had_headache had_pain had_nausea
##          <int>    <int>      <int>
## 1            8        8          6

This works, but we can do better with across():

symptoms %>% 
  summarise(
    across(
      .cols  = c(headache, pain, nausea),
      .fns   = ~ sum(.x, na.rm = TRUE)
    )
  )
## # A tibble: 1 × 3
##   headache  pain nausea
##      <int> <int>  <int>
## 1        8     8      6

Great! But, wouldn’t it be nice to know the proportion of people with each symptom as well? You may recall that R treats TRUE and FALSE as 1 and 0 when used in a mathematical operation. Additionally, you may already be aware that the mean of a set of 1’s and 0’s is equal to the proportion of 1’s in the set. For example, there are three ones and three zeros in the set (1, 1, 1, 0, 0, 0). The proportion of 1’s in the set is 3 out of 6, which is 0.5. Equivalently, the mean value of the set is (1 + 1 + 1 + 0 + 0 + 0) / 6, which equals 3 / 6, which is 0.5. So, when we have dummy variables like headache, pain, and nausea above, passing them to the mean() function returns the proportion of TRUE values. In this case, the proportion of people who had each symptom. We know we can do that calculation like this:

symptoms %>% 
  summarise(
    had_headache = mean(headache, na.rm = TRUE),
    had_pain     = mean(pain, na.rm = TRUE),
    had_nausea   = mean(nausea, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##   had_headache had_pain had_nausea
##          <dbl>    <dbl>      <dbl>
## 1        0.667    0.667        0.5

As before, we can do better with the across() function like this:

symptoms %>% 
  summarise(
    across(
      .cols = c(pain, headache, nausea),
      .fns  = ~ mean(.x, na.rm = TRUE)
    )
  )
## # A tibble: 1 × 3
##    pain headache nausea
##   <dbl>    <dbl>  <dbl>
## 1 0.667    0.667    0.5

Now, at this point, we might think, “wouldn’t it be nice to see the count and the proportion in the same result?” Well, we can do that by supplying our purrr-style lambdas as functions in a list of name-function pairs like this:

symptom_summary <- symptoms %>% 
  summarise(
    across(
      .cols = c(pain, headache, nausea),
      .fns  = list(
        count = ~ sum(.x, na.rm = TRUE),
        prop  = ~ mean(.x, na.rm = TRUE)
      )
    )
  ) %>% 
  print()
## # A tibble: 1 × 6
##   pain_count pain_prop headache_count headache_prop nausea_count nausea_prop
##        <int>     <dbl>          <int>         <dbl>        <int>       <dbl>
## 1          8     0.667              8         0.667            6         0.5

In this case, it’s probably fine to stop here. But, what if we had 20 or 30 symptoms that we were analyzing? It would be really difficult to read and compare them arranged horizontally like this, wouldn’t it?

Do you recall us discussing restructuring our results in the chapter on restructuring data frames? This is a circumstance where we might want to use pivot_longer() to make our results easier to read and interpret:

symptom_summary %>% 
  tidyr::pivot_longer(
    cols      = everything(),
    names_to  = c("symptom", ".value"),
    names_sep = "_"
  )
## # A tibble: 3 × 3
##   symptom  count  prop
##   <chr>    <int> <dbl>
## 1 pain         8 0.667
## 2 headache     8 0.667
## 3 nausea       6 0.5

There! Isn’t that result much easier to read?

For our final example of this section, let’s return the first example from the writing functions chapter. We started with some simulated study data:

study <- tibble(
  age       = c(32, 30, 32, 29, 24, 38, 25, 24, 48, 29, 22, 29, 24, 28, 24, 25, 
                25, 22, 25, 24, 25, 24, 23, 24, 31, 24, 29, 24, 22, 23, 26, 23, 
                24, 25, 24, 33, 27, 25, 26, 26, 26, 26, 26, 27, 24, 43, 25, 24, 
                27, 28, 29, 24, 26, 28, 25, 24, 26, 24, 26, 31, 24, 26, 31, 34, 
                26, 25, 27, NA),
  age_group = c(2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 
                1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 
                2, 1, 1, 1, NA),
  gender    = c(2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
                1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 
                1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 
                1, 1, 2, 1, NA),
  ht_in     = c(70, 63, 62, 67, 67, 58, 64, 69, 65, 68, 63, 68, 69, 66, 67, 65, 
                64, 75, 67, 63, 60, 67, 64, 73, 62, 69, 67, 62, 68, 66, 66, 62, 
                64, 68, NA, 68, 70, 68, 68, 66, 71, 61, 62, 64, 64, 63, 67, 66, 
                69, 76, NA, 63, 64, 65, 65, 71, 66, 65, 65, 71, 64, 71, 60, 62, 
                61, 69, 66, NA),
  wt_lbs    = c(216, 106, 145, 195, 143, 125, 138, 140, 158, 167, 145, 297, 146, 
                125, 111, 125, 130, 182, 170, 121, 98, 150, 132, 250, 137, 124, 
                186, 148, 134, 155, 122, 142, 110, 132, 188, 176, 188, 166, 136, 
                147, 178, 125, 102, 140, 139, 60, 147, 147, 141, 232, 186, 212, 
                110, 110, 115, 154, 140, 150, 130, NA, 171, 156, 92, 122, 102, 
                163, 141, NA),
  bmi       = c(30.99, 18.78, 26.52, 30.54, 22.39, 26.12, 23.69, 20.67, 26.29, 
                25.39, 25.68, 45.15, 21.56, 20.17, 17.38, 20.8, 22.31, 22.75, 
                26.62, 21.43, 19.14, 23.49, 22.66, 32.98, 25.05, 18.31, 29.13, 
                27.07, 20.37, 25.01, 19.69, 25.97, 18.88, 20.07, NA, 26.76, 
                26.97, 25.24, 20.68, 23.72, 24.82, 23.62, 18.65, 24.03, 23.86, 
                10.63, 23.02, 23.72, 20.82, 28.24, NA, 37.55, 18.88, 18.3, 
                19.13, 21.48, 22.59, 24.96, 21.63, NA, 29.35, 21.76, 17.97, 
                22.31, 19.27, 24.07, 22.76, NA),
  bmi_3cat  = c(3, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 
                1, 1, 3, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, NA, 2, 2, 2, 1, 1, 1, 1, 
                1, 1, 1, 1, 1, 1, 1, 2, NA, 3, 1, 1, 1, 1, 1, 1, 1, NA, 2, 1, 
                1, 1, 1, 1, 1, NA)
) %>% 
  mutate(
    age_group = factor(age_group, labels = c("Younger than 30", "30 and Older")),
    gender    = factor(gender, labels = c("Female", "Male")),
    bmi_3cat  = factor(bmi_3cat, labels = c("Normal", "Overweight", "Obese"))
  ) %>% 
  print()
## # A tibble: 68 × 7
##      age age_group       gender ht_in wt_lbs   bmi bmi_3cat  
##    <dbl> <fct>           <fct>  <dbl>  <dbl> <dbl> <fct>     
##  1    32 30 and Older    Male      70    216  31.0 Obese     
##  2    30 30 and Older    Female    63    106  18.8 Normal    
##  3    32 30 and Older    Female    62    145  26.5 Overweight
##  4    29 Younger than 30 Male      67    195  30.5 Obese     
##  5    24 Younger than 30 Female    67    143  22.4 Normal    
##  6    38 30 and Older    Female    58    125  26.1 Overweight
##  7    25 Younger than 30 Female    64    138  23.7 Normal    
##  8    24 Younger than 30 Male      69    140  20.7 Normal    
##  9    48 30 and Older    Male      65    158  26.3 Overweight
## 10    29 Younger than 30 Male      68    167  25.4 Overweight
## # ℹ 58 more rows

And wrote our own function to calculate the number of missing values, mean, median, min, and max for all of the continuous variables:

continuous_stats <- function(var) {
  study %>% 
    summarise(
      n_miss = sum(is.na({{ var }})),
      mean   = mean({{ var }}, na.rm = TRUE),
      median = median({{ var }}, na.rm = TRUE),
      min    = min({{ var }}, na.rm = TRUE),
      max    = max({{ var }}, na.rm = TRUE)
    )
}

We then used that function to calculate our statistics of interest for each continuous variable:

continuous_stats(age)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      1  26.9     26    22    48
continuous_stats(ht_in)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      3  66.0     66    58    76
continuous_stats(wt_lbs)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      2  148.   142.    60   297
continuous_stats(bmi)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      4  23.6   22.9  10.6  45.2

This is definitely an improvement over all the copying and pasting we were doing before we wrote our own function. However, there is still some unnecessary repetition above. One way we can remove this repetition is to use across() like this:

summary_stats <- study %>% 
  summarise(
    across(
      .cols = c(age, ht_in, wt_lbs, bmi),
      .fns  = list(
        n_miss = ~ sum(is.na(.x)),
        mean   = ~ mean(.x, na.rm = TRUE),
        median = ~ median(.x, na.rm = TRUE),
        min    = ~ min(.x, na.rm = TRUE),
        max    = ~ max(.x, na.rm = TRUE)
      )
    ) 
  ) %>% 
  print()
## # A tibble: 1 × 20
##   age_n_miss age_mean age_median age_min age_max ht_in_n_miss ht_in_mean ht_in_median ht_in_min ht_in_max wt_lbs_n_miss wt_lbs_mean wt_lbs_median wt_lbs_min
##        <int>    <dbl>      <dbl>   <dbl>   <dbl>        <int>      <dbl>        <dbl>     <dbl>     <dbl>         <int>       <dbl>         <dbl>      <dbl>
## 1          1     26.9         26      22      48            3       66.0           66        58        76             2        148.          142.         60
## # ℹ 6 more variables: wt_lbs_max <dbl>, bmi_n_miss <int>, bmi_mean <dbl>, bmi_median <dbl>, bmi_min <dbl>, bmi_max <dbl>

This method works, but it has the same problem that our symptom summaries had above. Our results are hard to read and interpret because they are arranged horizontally. We can once again pivot this data longer, but it won’t be quite as easy as it was before. Our first attempt might look like this:

summary_stats %>% 
  tidyr::pivot_longer(
    cols      = everything(),
    names_to  = c("characteristic", ".value"),
    names_sep = "_"
  )
## Warning: Expected 2 pieces. Additional pieces discarded in 12 rows [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].
## # A tibble: 12 × 8
##    characteristic     n  mean median   min   max  `in`   lbs
##    <chr>          <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 age                1  26.9   26    22    48    NA     NA 
##  2 ht                NA  NA     NA    NA    NA     3     NA 
##  3 ht                NA  NA     NA    NA    NA    66.0   NA 
##  4 ht                NA  NA     NA    NA    NA    66     NA 
##  5 ht                NA  NA     NA    NA    NA    58     NA 
##  6 ht                NA  NA     NA    NA    NA    76     NA 
##  7 wt                NA  NA     NA    NA    NA    NA      2 
##  8 wt                NA  NA     NA    NA    NA    NA    148.
##  9 wt                NA  NA     NA    NA    NA    NA    142.
## 10 wt                NA  NA     NA    NA    NA    NA     60 
## 11 wt                NA  NA     NA    NA    NA    NA    297 
## 12 bmi                4  23.6   22.9  10.6  45.2  NA     NA

What do you think the problem is here?

Well, we passed an underscore to the names_sep argument. This tells pivot_longer() that that character string on the left side of the underscore should make up the values of the new characteristic column and each unique character string on the right side of the underscore should be used to create a new column name. In the symptoms data, this worked fine because all of the column names followed this pattern (e.g., pain_count and pain_prop). But, do the column names in summary_stats always follow this pattern? What about age_n_miss and ht_in_n_miss? All the extra underscores in the column names makes this pattern ineffective.

There are probably many ways we could address this problem. We think the most straightforward way is probably to go back to the code we used to create summary_stats and use the .names argument to separate the column name and statistic name with a character other than an underscore. Maybe a hyphen instead:

summary_stats <- study %>% 
  summarise(
    across(
      .cols  = c(age, ht_in, wt_lbs, bmi),
      .fns   = list(
        n_miss = ~ sum(is.na(.x)),
        mean   = ~ mean(.x, na.rm = TRUE),
        median = ~ median(.x, na.rm = TRUE),
        min    = ~ min(.x, na.rm = TRUE),
        max    = ~ max(.x, na.rm = TRUE)
      ),
      .names = "{col}-{fn}" # This is the new part of the code
    ) 
  ) %>% 
  print()
## # A tibble: 1 × 20
##   `age-n_miss` `age-mean` `age-median` `age-min` `age-max` `ht_in-n_miss` `ht_in-mean` `ht_in-median` `ht_in-min` `ht_in-max` `wt_lbs-n_miss` `wt_lbs-mean`
##          <int>      <dbl>        <dbl>     <dbl>     <dbl>          <int>        <dbl>          <dbl>       <dbl>       <dbl>           <int>         <dbl>
## 1            1       26.9           26        22        48              3         66.0             66          58          76               2          148.
## # ℹ 8 more variables: `wt_lbs-median` <dbl>, `wt_lbs-min` <dbl>, `wt_lbs-max` <dbl>, `bmi-n_miss` <int>, `bmi-mean` <dbl>, `bmi-median` <dbl>,
## #   `bmi-min` <dbl>, `bmi-max` <dbl>

Now, we can simply pass a hyphen to the names_sep argument to pivot_longer():

summary_stats %>% 
  tidyr::pivot_longer(
    cols      = everything(),
    names_to  = c("characteristic", ".value"),
    names_sep = "-"
  )
## # A tibble: 4 × 6
##   characteristic n_miss  mean median   min   max
##   <chr>           <int> <dbl>  <dbl> <dbl> <dbl>
## 1 age                 1  26.9   26    22    48  
## 2 ht_in               3  66.0   66    58    76  
## 3 wt_lbs              2 148.   142.   60   297  
## 4 bmi                 4  23.6   22.9  10.6  45.2

Look at how much easier those results are to read!

rm(study, summary_stats, continuous_stats)

35.4 Across with filter

We’ve already discussed complete case analyses multiple times in this book. That is, including only the rows from our data frame that don’t have any missing values in our analysis. Additionally, we’ve already seen how we can use the filter() function to remove the rows of a single column where the data are missing. For example:

df_xyz %>% 
  filter(!is.na(x))
## # A tibble: 9 × 4
##     row       x      y      z
##   <int>   <dbl>  <dbl>  <dbl>
## 1     1 -0.560   1.22  -1.07 
## 2     3  1.56    0.401 -1.03 
## 3     4  0.0705 NA     -0.729
## 4     5  0.129  -0.556 -0.625
## 5     6  1.72    1.79  NA    
## 6     7  0.461   0.498  0.838
## 7     8 -1.27   -1.97   0.153
## 8     9 -0.687   0.701 -1.14 
## 9    10 -0.446  -0.473  1.25

Notice that row 2 – the row that had a missing value for x – is no longer in the data frame, and we can now easily calculate the mean value of x.

df_xyz %>% 
  filter(!is.na(x)) %>% 
  summarise(mean = mean(x))
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.108

However, we want to remove the rows that have a missing value in any column – not just x. We could get this result using multiple sequential filter() functions like this:

df_xyz %>% 
  filter(!is.na(x)) %>% 
  filter(!is.na(y)) %>% 
  filter(!is.na(z))
## # A tibble: 7 × 4
##     row      x      y      z
##   <int>  <dbl>  <dbl>  <dbl>
## 1     1 -0.560  1.22  -1.07 
## 2     3  1.56   0.401 -1.03 
## 3     5  0.129 -0.556 -0.625
## 4     7  0.461  0.498  0.838
## 5     8 -1.27  -1.97   0.153
## 6     9 -0.687  0.701 -1.14 
## 7    10 -0.446 -0.473  1.25

As you can see, rows 2, 4, and 6 – the rows with a missing value for x, y, and z – were dropped.

🚩Of course, in the code chunk above, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code.

At this point in the book, our first thought might be to use the across() function, inside the filter() function, to remove all of the rows rows with missing values from our data frame. However, as of dplyr version 1.0.4, using the across() function inside of filter() is deprecated. That means we shouldn’t use it anymore. Instead, we should use the if_any() or if_all() functions, which take the exact same arguments as across(). In the code chunk below, we will show you how to solve this problem, then we will dissect the solution below.

df_xyz %>% 
  filter(
    if_all(
      .cols = c(x:z),
      .fns  = ~ !is.na(.x)
    )
  )
## # A tibble: 7 × 4
##     row      x      y      z
##   <int>  <dbl>  <dbl>  <dbl>
## 1     1 -0.560  1.22  -1.07 
## 2     3  1.56   0.401 -1.03 
## 3     5  0.129 -0.556 -0.625
## 4     7  0.461  0.498  0.838
## 5     8 -1.27  -1.97   0.153
## 6     9 -0.687  0.701 -1.14 
## 7    10 -0.446 -0.473  1.25

👆Here’s what we did above:

  • You can type ?dplyr::if_any or ?dplyr::if_all into your R console to view the help documentation for this function and follow along with the explanation below.

  • We used the if_all() function inside of the filter() function to keep only the rows in our data frame that had nonmissing values for all of the columns x, y, and z.

  • We passed the value c(x:z) to the .cols argument. This told R to apply the function passed to the .fns argument to the columns x through z inclusive.

  • We used a purrr-style lambda to test whether or not each value of each of the columns passed to .cols is NOT missing.

  • Remember, the special .x symbol is just shorthand for each column passed to the .cols argument.

So, how does this work? Well, first let’s remember that the is.na() function returns TRUE when the value of the vector passed to it is missing and FALSE when it is not missing. For example:

is.na(df_xyz$x)
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

We can then use the ! operator to “flip” those results. In other words, to return TRUE when the value of the vector passed to it is not missing and FALSE when it is missing. For example:

!is.na(df_xyz$x)
##  [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

The filter() function then returns the rows from the data frame where the values returned by !is.na() are TRUE and drops the rows where they are FALSE. For example, we can copy and paste the TRUE/FALSE values above to keep only the rows with nonmissing values for x:

df_xyz %>% 
  filter(c(TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE))
## # A tibble: 9 × 4
##     row       x      y      z
##   <int>   <dbl>  <dbl>  <dbl>
## 1     1 -0.560   1.22  -1.07 
## 2     3  1.56    0.401 -1.03 
## 3     4  0.0705 NA     -0.729
## 4     5  0.129  -0.556 -0.625
## 5     6  1.72    1.79  NA    
## 6     7  0.461   0.498  0.838
## 7     8 -1.27   -1.97   0.153
## 8     9 -0.687   0.701 -1.14 
## 9    10 -0.446  -0.473  1.25

Now, let’s repeat this process for the columns y and z as well.

!is.na(df_xyz$y)
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
!is.na(df_xyz$z)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

Next, let’s stack these results next to each other to make them even easier to view.

not_missing <- tibble(
  row = 1:10,
  x   = !is.na(df_xyz$x),
  y   = !is.na(df_xyz$y),
  z   = !is.na(df_xyz$z)
) %>% 
  print()
## # A tibble: 10 × 4
##      row x     y     z    
##    <int> <lgl> <lgl> <lgl>
##  1     1 TRUE  TRUE  TRUE 
##  2     2 FALSE TRUE  TRUE 
##  3     3 TRUE  TRUE  TRUE 
##  4     4 TRUE  FALSE TRUE 
##  5     5 TRUE  TRUE  TRUE 
##  6     6 TRUE  TRUE  FALSE
##  7     7 TRUE  TRUE  TRUE 
##  8     8 TRUE  TRUE  TRUE 
##  9     9 TRUE  TRUE  TRUE 
## 10    10 TRUE  TRUE  TRUE

👆Here’s what we did above:

  • We created a data frame that contains the value TRUE in each position where df_xyz has a nonmissing value and FALSE in each position where df_xyz has a missing value. We wouldn’t typically create this for our data analysis. We just created it here for teaching purposes.

You can think of the data frame of TRUE and FALSE values above as an intermediate product that if_any() and if_all() uses “under the hood” to decide which rows to keep. We think using this data frame as a conceptual model makes it a little easier to understand how if_any() and if_all() differ.

if_any() will keep the rows where any value of x, y, or z are TRUE. In this case, there is at least one TRUE value in every row. Therefore, we would expect if_any() to return all rows in our data frame. And, that’s exactly what happens.

df_xyz %>% 
  filter(
    if_any(
      .cols = c(x:z),
      .fns  = ~ !is.na(.x)
    )
  )
## # A tibble: 10 × 4
##      row       x      y      z
##    <int>   <dbl>  <dbl>  <dbl>
##  1     1 -0.560   1.22  -1.07 
##  2     2 NA       0.360 -0.218
##  3     3  1.56    0.401 -1.03 
##  4     4  0.0705 NA     -0.729
##  5     5  0.129  -0.556 -0.625
##  6     6  1.72    1.79  NA    
##  7     7  0.461   0.498  0.838
##  8     8 -1.27   -1.97   0.153
##  9     9 -0.687   0.701 -1.14 
## 10    10 -0.446  -0.473  1.25

On the other hand, if_all() will the keep the rows where all value of x, y, and z are TRUE. In this case, there is at least one FALSE value in rows 2, 4, and 6. Therefore, we would expect if_all() to return all rows in our data frame except rows 2, 4, and 6. That’s exactly what happens, and it’s exaclty the result we want.

df_xyz %>% 
  filter(
    if_all(
      .cols = c(x:z),
      .fns  = ~ !is.na(.x)
    )
  )
## # A tibble: 7 × 4
##     row      x      y      z
##   <int>  <dbl>  <dbl>  <dbl>
## 1     1 -0.560  1.22  -1.07 
## 2     3  1.56   0.401 -1.03 
## 3     5  0.129 -0.556 -0.625
## 4     7  0.461  0.498  0.838
## 5     8 -1.27  -1.97   0.153
## 6     9 -0.687  0.701 -1.14 
## 7    10 -0.446 -0.473  1.25

Because this is a small, simple example, using if_all() doesn’t actually reduce the number of lines of code we wrote. But again, try to imagine if we added 20 additional columns to our data frame. We would only need to update the value we pass to the .cols argument. This makes our code more concise, easier to maintain, and less error-prone.

35.5 Summary

We are big fans of using across(), if_any(), and if_all() in conjunction with the dplyr verbs. They allows us to remove a lot of the unnecessary repetition from our code in a way that integrates pretty seamlessly with the tools we are already using. Perhaps you will see value in using these functions as well. In the next chapter, we will learn about using for loops to remove unnecessary repetition from our code.