7 Column Names: Prepping Many for Merge

7.1 ⭐️Overview

The inspiration for this chapter came about when I was trying to merge the QDS visit 1 through visit 5 files that were part of the Link2Care project. All of the QDS files (plus some other files) needed to be merged together into a final analysis data set. However, not all questions were asked at all visits and column names were inconsistently used across visits. Even when the names were almost the same, a visit number was attached to the name (e.g., TODAY_V1, TODAY_V2, TODAY_V3, etc.). Finally, each data set contained 800+ columns, which made renaming them all manually tedious and error prone. So, I’m trying to come up with a better way.

One method I came up with was reading in the variable names from the Word codebooks and then modifying them. Using this method required making some custom styles in Word and applying those styles to the column names. This seemed to work pretty well at first, but then Joe and James made some changes to QDS, which rendered the codebooks I’d made all the changes to obsolete. This left me with two options:

Have some of the variable name adjustments pulled from the codebook and some of the variable name adjustments done manually in the R code (e.g., T26, visit type).
Keep all of the adjustments in the codebook, which means converting the new codebooks to word documents, and going back through them to format all of the old column names again.

Neither of these options seems very robust or future-proof. While it might be a lot of work upfront, I’m starting to think that just doing all of the variable renaming in the R code might be the way to go. Below, I’m going to play around with how to automate as much of this process as possible.

7.2 🌎Useful websites

y## 📦Load packages

library(dplyr)
library(stringr)
library(testthat)
library(tidyr)
library(purrr)
library(googlesheets4)

7.3 🔢Simulate data

Simulate 3 QDS data sets. Don’t worry about the values much. They really aren’t the issue for this example.

SUBJECT: No change in column name across visits.
TODAY_V and VISIT_V: The _V1, _V2, _V3 part changes.
SQ_2: Only exists in v1, and the name is altered.
GROUP: Only exists in v3.
WEIGHT and WEIGHT3: Doesn’t exist in v2 and doesn’t follow any naming convention.
DEMO16G and DEM14GV3: The variable was named incorrectly in v1.
DS10A_V2: Only asked in v2 and doesn’t follow typical naming convention.
DEM15V1: Only asked in v1 and I want to change the name completely.
T26A_V1: Only exists in v1 and v3. Also, it needs to be renamed.
T26A_V1A and T26A_V1B: Only exists in v1 and v3. Needs to be renamed, and it requires using the dummy variable naming convention.
T31_V1B, T31_V3B, T31B_V1, and T31B_V3: Added these later. In v1, v3, v4, and v5 there is a column named T31B_V1 and a column named T31_V1B. The standardize_col_names() function converts both of these to t_31b, which causes rename_with() to throw an error. Further, I can’t just manually change it to t33 (what it will ultimately end up being named) before I run standardize_col_names() because then in conflicts with T33_V1. I need to find a solution inside of the standardize_col_names() function.

Additionally, we want to convert all column names to lower case and follow these naming conventions:

The typical way that variables are named in these data sets is:
- [optionally:dummy_variable_lettered]
  - For example: BPM1_V1, LSN3_V1
- _
  [optionally:dummy_variable_lettered]
  - For example: Screening Question 2 is named: SQ_2
  - OR: The dummy variable for the “EMAIL” Response to Screening Question 18 is: SQ_18A
- Many questions also have the visit number embedded in the name.
  - For example: Demographics question 1 is: DEM1V1.
- However, there are some exceptions to this rule - intentional or not.
  - For example, the fifth MMSE question is: MMS4V
In order to improve consistency in the variable names, we will adopt the following convention:
- _
  [optionally:dummy_variable_lettered]
  - dummy variables will be numbered a-z
  - For example, sq_12, t_14a
- We will drop visit number from the variable name an use an explicit visit variable.
- Exceptions: Demographic variables that almost anyone should be reasonably able to understand without the codebook will be given intuitive, descriptive names (e.g., age, race, gender, marital_status, etc.).
- I’m also going to change “SUBJECT” to “id”.

v1 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V1 = as.Date(c("2018-04-07", "2018-05-03")),
  VISIT_V1 = 1,
  SQ_2     = c(1, 0),
  WEIGHT   = c(179, 194),
  DEMO16G  = c(0, 1),
  DEM15V1  = c(1, 0),
  T26A_V1  = c(2, 4),
  T26A_V1A = c(0, 0),
  T26A_V1B = c(1, 0),
  T31_V1B  = c(0, 0), 
  T31B_V1  = c(0, 0)
)

v2 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V2 = as.Date(c("2018-04-30", "2018-05-07")),
  VISIT_V2 = 2
)

v3 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V3 = as.Date(c("2018-05-31", "2018-06-08")),
  VISIT_V3 = 3,
  GROUP    = c(1, 2),
  WEIGHT3  = c(179, 194),
  DEM14GV3 = c(2, 3),
  T26A_V3  = c(2, 4),
  T26A_V3A = c(0, 0),
  T26A_V3B = c(1, 0),
  T31_V3B  = c(0, 0), 
  T31B_V3  = c(0, 0)
)

7.4 Get the existing column names

v1_old_names <- names(v1)
v1_old_names

##  [1] "SUBJECT"  "TODAY_V1" "VISIT_V1" "SQ_2"     "WEIGHT"   "DEMO16G" 
##  [7] "DEM15V1"  "T26A_V1"  "T26A_V1A" "T26A_V1B" "T31_V1B"  "T31B_V1"

7.5 Standardize new column names

Remove _v{1, 2, 3} and v{1, 2, 3} from column names
Replace spaces with underscores
Convert to lower case
Add underscore in-between the abbreviated tool name and question number

This won’t be perfect, but it will drastically reduce the number of manual changes we have to make.

# Remove _v1 from column name
# Remove v1 at end of column name (e.g., DEM1v3)
v1_new_names <- stringr::str_replace(v1_old_names, "_V1|V1", "")
# Replace spaces with underscores
v1_new_names <- stringr::str_replace_all(v1_new_names, " ", "_")
# Convert to lower case
v1_new_names <- stringr::str_to_lower(v1_new_names)
# Add underscore in-between the abbreviated tool name and question number
v1_new_names <- stringr::str_replace(v1_new_names, "([a-z])(\\d)", "\\1_\\2")
v1_new_names

##  [1] "subject"  "today"    "visit"    "sq_2"     "weight"   "demo_16g"
##  [7] "dem_15"   "t_26a"    "t_26aa"   "t_26ab"   "t_31b"    "t_31b"

7.6 Turn this into a function

standardize_col_names <- function(.data) {
  # Get the existing column names
  old_names <- names(.data)
  # Create a string that contains _V1 through _V5 and V1 through V5 to pass to
  # the regular expression below
  v_num <- paste0("_V", 1:5, collapse = "|")
  vnum <- paste0("V", 1:5, collapse = "|")
  v_pattern <- paste(paste(v_num, vnum, sep = "|"), collapse = "|")
  # Remove _v{1, 2, 3} from column name
  # Remove v{1, 2, 3} at end of column name (e.g., DEM1v3)
  new_names <- stringr::str_replace(old_names, v_pattern, "")
  # Replace spaces with underscores
  new_names <- stringr::str_replace_all(new_names, " ", "_")
  # Convert to lower case
  new_names <- stringr::str_to_lower(new_names)
  # Add underscore in-between the abbreviated tool name and question number
  new_names <- stringr::str_replace(new_names, "([a-z])(\\d)", "\\1_\\2")
  # Return character vector of standardized column names
  new_names
}

# For testing
# standardize_col_names(v1)

7.6.1 Make a unit test for the function

test_that("standardize_col_names produces the expected column names.", {
  v1_test <- tibble(
    SUBJECT  = c(2001, 2002),
    TODAY_V1 = as.Date(c("2018-04-07", "2018-05-03")),
    VISIT_V1 = 1,
    SQ_2     = c(1, 0),
    WEIGHT   = c(179, 194),
    DEMO16G  = c(0, 1),
    DEM15V1  = c(1, 0),
    T26A_V1  = c(2, 4),
    T26A_V1A = c(0, 0),
    T26A_V1B = c(1, 0),
    TEST_V8 = 0
  )
  new_names <- standardize_col_names(v1_test)
  expceted_new_names <- c(
    "subject", "today", "visit", "sq_2", "weight", "demo_16g", "dem_15", 
    "t_26a", "t_26aa", "t_26ab", "test_v_8"
  )
  expect_equal(new_names, expceted_new_names)
})

## Test passed 🥇

new_names <- standardize_col_names(v1)

7.7 Manual changes to column names

The step above used to standardize column names fixes most of them, but some manual changes to variable names still need to be made.

new_names[new_names == "sq_2"] <- "sq_hispanic"
# demo_16g was incorrectly numbered when the data was created
new_names[new_names == "demo_16g"] <- "dem_14g"
new_names[new_names == "dem_15"] <- "dem_veteran"
# Take the extra "a" out of all the T26 col names
new_names <- str_replace(new_names, "t_26a", "t_26")
# Change the second t_31b to t_33
new_names[new_names == "t_31b"][2] <- "t_33"

7.8 Have R write out the rename code

So that you don’t have to type it out by hand. Ideally, you only have to run this code once, copy and paste into the rename code chunk below, and then just make small updates to the rename code chunk if changes are made to QDS.

# If needed, run the code below. Copy and paste the output into the rename() 
# code chunk. Make manual changes as needed.
new_old_name_pairs <- paste(new_names, names(v1), sep = " = ")
new_old_name_pairs <- paste(new_old_name_pairs, collapse = ", \n")
cat(new_old_name_pairs)

## subject = SUBJECT, 
## today = TODAY_V1, 
## visit = VISIT_V1, 
## sq_hispanic = SQ_2, 
## weight = WEIGHT, 
## dem_14g = DEMO16G, 
## dem_veteran = DEM15V1, 
## t_26 = T26A_V1, 
## t_26a = T26A_V1A, 
## t_26b = T26A_V1B, 
## t_31b = T31_V1B, 
## t_33 = T31B_V1

7.8.1 Make it a function

write_rename <- function(.data, .new_names) {
  new_old_name_pairs <- paste(.new_names, names(.data), sep = " = ")
  new_old_name_pairs <- paste(new_old_name_pairs, collapse = ", \n")
  cat(new_old_name_pairs)
}

# For testing
write_rename(v1, new_names)

## subject = SUBJECT, 
## today = TODAY_V1, 
## visit = VISIT_V1, 
## sq_hispanic = SQ_2, 
## weight = WEIGHT, 
## dem_14g = DEMO16G, 
## dem_veteran = DEM15V1, 
## t_26 = T26A_V1, 
## t_26a = T26A_V1A, 
## t_26b = T26A_V1B, 
## t_31b = T31_V1B, 
## t_33 = T31B_V1

7.9 Rename the columns

v1 <- v1 %>% 
  rename(
    subject = SUBJECT, 
    today = TODAY_V1, 
    visit = VISIT_V1, 
    sq_hispanic = SQ_2, 
    weight = WEIGHT, 
    dem_14g = DEMO16G, 
    dem_veteran = DEM15V1, 
    t_26 = T26A_V1, 
    t_26a = T26A_V1A, 
    t_26b = T26A_V1B, 
    t_31b = T31_V1B, 
    t_33 = T31B_V1
  )

7.10 Repeat the process for V2

new_names <- standardize_col_names(v2)
write_rename(v2, new_names)

## subject = SUBJECT, 
## today = TODAY_V2, 
## visit = VISIT_V2

v2 <- v2 %>% 
  rename(
    subject = SUBJECT, 
    today = TODAY_V2, 
    visit = VISIT_V2
  )

7.11 Repeat the process for V3

new_names <- standardize_col_names(v3)

new_names[new_names == "weight_3"] <- "weight"
new_names[new_names == "dem_15"] <- "dem_veteran"
# Take the extra "a" out of all the T26 col names
new_names <- str_replace(new_names, "t_26a", "t_26")
# Change the second t_31b to t_33
new_names[new_names == "t_31b"][2] <- "t_33"

write_rename(v3, new_names)

## subject = SUBJECT, 
## today = TODAY_V3, 
## visit = VISIT_V3, 
## group = GROUP, 
## weight = WEIGHT3, 
## dem_14g = DEM14GV3, 
## t_26 = T26A_V3, 
## t_26a = T26A_V3A, 
## t_26b = T26A_V3B, 
## t_31b = T31_V3B, 
## t_33 = T31B_V3

v3 <- v3 %>% 
  rename(
    subject = SUBJECT, 
    today = TODAY_V3, 
    visit = VISIT_V3, 
    group = GROUP, 
    weight = WEIGHT3, 
    dem_14g = DEM14GV3, 
    t_26 = T26A_V3, 
    t_26a = T26A_V3A, 
    t_26b = T26A_V3B, 
    t_31b = T31_V3B, 
    t_33 = T31B_V3
  )

7.12 Merge all 3 visits

Merge all 3 visits into a single data frame

qds_all_visits <- v1 %>% 
  bind_rows(v2) %>% 
  bind_rows(v3) %>% 
  arrange(subject, visit)

qds_all_visits

## # A tibble: 6 × 13
##   subject today      visit sq_hispanic weight dem_14g dem_veteran  t_26 t_26a
##     <dbl> <date>     <dbl>       <dbl>  <dbl>   <dbl>       <dbl> <dbl> <dbl>
## 1    2001 2018-04-07     1           1    179       0           1     2     0
## 2    2001 2018-04-30     2          NA     NA      NA          NA    NA    NA
## 3    2001 2018-05-31     3          NA    179       2          NA     2     0
## 4    2002 2018-05-03     1           0    194       1           0     4     0
## 5    2002 2018-05-07     2          NA     NA      NA          NA    NA    NA
## 6    2002 2018-06-08     3          NA    194       3          NA     4     0
## # ℹ 4 more variables: t_26b <dbl>, t_31b <dbl>, t_33 <dbl>, group <dbl>

7.13 Final data cleaning

Make a couple more column name changes
Carry forward time-invariant variables

qds_all_visits %>% 
  rename(id = subject, date_visit = today) %>% 
  # Carry forward time-invariant variables
  group_by(id) %>% 
  fill(sq_hispanic, dem_veteran, group, .direction = "downup") %>% 
  ungroup %>% 
  DT::datatable(options = list(scrollX = TRUE))

7.14 🟣The dplyr way

Try getting the same result as above using dplyr (rename_with) functions instead of creating and manipulating separate column name vectors.

# Clean up environment first
rm(list = ls())

v1 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V1 = as.Date(c("2018-04-07", "2018-05-03")),
  VISIT_V1 = 1,
  SQ_2     = c(1, 0),
  WEIGHT   = c(179, 194),
  DEMO16G  = c(0, 1),
  DEM15V1  = c(1, 0),
  T26A_V1  = c(2, 4),
  T26A_V1A = c(0, 0),
  T26A_V1B = c(1, 0),
  T31_V1B  = c(0, 0), 
  T31B_V1  = c(0, 0),
  # Making these up to test standardize_col_names when more than one duplicate 
  # column name would be created.
  T32_V1B  = c(1, 1),
  T32B_V1  = c(1, 1)
)

v2 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V2 = as.Date(c("2018-04-30", "2018-05-07")),
  VISIT_V2 = 2
)

v3 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V3 = as.Date(c("2018-05-31", "2018-06-08")),
  VISIT_V3 = 3,
  GROUP    = c(1, 2),
  WEIGHT3  = c(179, 194),
  DEM14GV3 = c(2, 3),
  T26A_V3  = c(2, 4),
  T26A_V3A = c(0, 0),
  T26A_V3B = c(1, 0),
  T31_V3B  = c(0, 0), 
  T31B_V3  = c(0, 0),
  # Making these up to test standardize_col_names when more than one duplicate 
  T32_V3B  = c(1, 1), 
  T32B_V3  = c(1, 1)
)

7.14.1 Standardize new column names

Remove _v{1, 2, 3} and v{1, 2, 3} from column names
Replace spaces with underscores
Convert to lower case
Add underscore in-between the abbreviated tool name and question number

This won’t be perfect, but it will drastically reduce the number of manual changes we have to make.

standardize_col_names <- function(.data) {
  # Get the existing column names
  # When used inside of rename_with(), the column names are automatically
  # passed into the function with/as .data. There is no need for names(.data).
  old_names <- .data
  # Create a string that contains _V1 through _V5 and V1 through V5 to pass to
  # the regular expression below
  v_num <- paste0("_V", 1:5, collapse = "|")
  vnum <- paste0("V", 1:5, collapse = "|")
  v_pattern <- paste(paste(v_num, vnum, sep = "|"), collapse = "|")
  # Remove _v{1, 2, 3} from column name
  # Remove v{1, 2, 3} at end of column name (e.g., DEM1v3)
  new_names <- stringr::str_replace(old_names, v_pattern, "")
  # Replace spaces with underscores
  new_names <- stringr::str_replace_all(new_names, " ", "_")
  # Convert to lower case
  new_names <- stringr::str_to_lower(new_names)
  # Add underscore in-between the abbreviated tool name and question number
  new_names <- stringr::str_replace(new_names, "([a-z])(\\d)", "\\1_\\2")
  # Check for duplicate names
  if (!length(new_names) == length(unique(new_names))) {
    name_counts <- table(new_names)
    dup_names <- name_counts[name_counts > 1]
    dup_names <- names(dup_names)
    # Set the second instance of the duplicate name back to its original name
    for (i in seq_along(dup_names)) {
      index <- which(new_names == dup_names[[i]])
      new_names[index][2] <- old_names[index][2]
      # Print a message letting the user know what happened
      warning(
        "Warning: ", old_names[index][2], 
        " Could not be standardized because it would have created a duplicate column name. \n"
      )
    }
  }
  # Return character vector of standardized column names
  new_names
}

# For testing
# standardize_col_names(names(v1))
# rename_with(v1, standardize_col_names)

Standardize all 3 data frames in one step

purrr::walk(
  .x = paste0("v", 1:3),
  .f = function(df_name) {
    df <- get(df_name, envir = .GlobalEnv)
    df <- rename_with(df, standardize_col_names)
    assign(df_name, df, envir = .GlobalEnv)
  }
)

## Warning in .fn(sel, ...): Warning: T31B_V1 Could not be standardized because it would have created a duplicate column name.

## Warning in .fn(sel, ...): Warning: T32B_V1 Could not be standardized because it would have created a duplicate column name.

## Warning in .fn(sel, ...): Warning: T31B_V3 Could not be standardized because it would have created a duplicate column name.

## Warning in .fn(sel, ...): Warning: T32B_V3 Could not be standardized because it would have created a duplicate column name.

7.14.2 Manual changes to column names 1

The step above used to standardize column names fixes most of them, but some manual changes to variable names still need to be made.

First, only make the changes to the individual data frames that will affect the merge (e.g., weight and weight_3).

After the merge, we will make changes to column names that need to be changed in multiple data frames (e.g., t_26a). By doing this after the merge, it reduces code repetition and reduces the risk of typos in the column names.

# demo_16g was incorrectly numbered when the data was created
v1 <- rename(v1, dem_14g = demo_16g)

# weight_3 was missing a "v" when it was created in the data frame
v3 <- rename(v3, weight = weight_3)

7.14.3 Merge all 3 visits

Merge all 3 visits into a single data frame

qds_all_visits <- v1 %>% 
  bind_rows(v2) %>% 
  bind_rows(v3) %>% 
  arrange(subject, visit)

DT::datatable(qds_all_visits, options = list(scrollX = TRUE))

7.14.4 Manual changes to column names 2

We already made changes to the individual data frames that would have affected the merge (e.g., weight and weight_3).

Now, we will make changes to column names that need to be changed in multiple data frames (e.g., t_26a).

qds_all_visits <- qds_all_visits %>%
  rename(
    id = subject, 
    date_visit = today,
    sq_hispanic = sq_2,
    dem_veteran = dem_15
  ) %>% 
  # Take the extra "a" out of all the T26 col names
  rename_with(
    ~ str_replace(.x, "t_26a", "t_26")
  ) %>% 
  # Combine T31B_V1 and T31B_V3 as t_33
  # Combine T32B_V1 and T32B_V3 as t_34
  mutate(
    t_33 = if_else(is.na(T31B_V1), T31B_V3, T31B_V1),
    t_34 = if_else(is.na(T32B_V1), T32B_V3, T32B_V1)
  ) %>% 
  select(-T31B_V1, -T31B_V3, -T32B_V1, -T32B_V3)

DT::datatable(qds_all_visits, options = list(scrollX = TRUE))

7.14.5 Final data cleaning

Carry forward time-invariant variables

qds_all_visits %>% 
  # Carry forward time-invariant variables
  group_by(id) %>% 
  fill(sq_hispanic, dem_veteran, group, .direction = "downup") %>% 
  ungroup %>% 
  DT::datatable(options = list(scrollX = TRUE))

7.15 🟢 Making a key in Google Sheets

Why might I want to do this in the first place?

Pros:

Everyone can use Excel. Not everyone is comfortable using RStudio. So, people other than me may be intimidated if they need to look up the changes to the column names by opening an Rmd file in RStudio.
If I need to look up changes to column names, it may be faster to do so in Excel.

Cons:

Creating the Excel sheet may be an additional layer of work.
This process creates additional documents (i.e., other than the Rmd files) to keep track of and update. However, if ALL of the updates to column names go in this spreadsheet, then it should really be the only one that requires tracking and updating.
Hard to use versioning for Excel.

Why use Google Sheets instead of Excel?

It has all of the pros above, plus versioning. Also, Google Sheets makes it easy to use regular expressions, which I’ve found useful in the past.

# Clean up environment first
rm(list = ls())

v1 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V1 = as.Date(c("2018-04-07", "2018-05-03")),
  VISIT_V1 = 1,
  SQ_2     = c(1, 0),
  WEIGHT   = c(179, 194),
  DEMO16G  = c(0, 1),
  DEM15V1  = c(1, 0),
  T26A_V1  = c(2, 4),
  T26A_V1A = c(0, 0),
  T26A_V1B = c(1, 0),
  T31_V1B  = c(0, 0), 
  T31B_V1  = c(0, 0),
  # Making these up to test standardize_col_names when more than one duplicate 
  # column name would be created.
  T32_V1B  = c(1, 1),
  T32B_V1  = c(1, 1)
)

v2 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V2 = as.Date(c("2018-04-30", "2018-05-07")),
  VISIT_V2 = 2
)

v3 <- tibble(
  SUBJECT  = c(2001, 2002),
  TODAY_V3 = as.Date(c("2018-05-31", "2018-06-08")),
  VISIT_V3 = 3,
  GROUP    = c(1, 2),
  WEIGHT3  = c(179, 194),
  DEM14GV3 = c(2, 3),
  T26A_V3  = c(2, 4),
  T26A_V3A = c(0, 0),
  T26A_V3B = c(1, 0),
  T31_V3B  = c(0, 0), 
  T31B_V3  = c(0, 0),
  # Making these up to test standardize_col_names when more than one duplicate 
  T32_V3B  = c(1, 1), 
  T32B_V3  = c(1, 1)
)

7.15.1 Create the Google Sheet column name key

NOTE: You should only have to do this once. After you create the initial spreadsheet, all future changes to column names should occur directly in the spreadsheet.

Create a new Google sheet.
- Name it.
- Rename the first sheet “visit1”.
- Name column A old_name and column B new_name.
Create a character string of existing column names that you can copy and paste into the old_name column of the Google Sheet.

names(v1) %>% 
  paste(collapse = "\n") %>% 
  cat()

## SUBJECT
## TODAY_V1
## VISIT_V1
## SQ_2
## WEIGHT
## DEMO16G
## DEM15V1
## T26A_V1
## T26A_V1A
## T26A_V1B
## T31_V1B
## T31B_V1
## T32_V1B
## T32B_V1

You can directly type new names into the new_name column. Or, you can use regular expressions in steps similar to what the standardize_col_names() function we created above does:
- Move new_name to column E
- Name column B lower and type the following function into B2: =LOWER(A2). Paste the formula down.
- Name column C remove_v1 and type the following function into C2: =REGEXREPLACE(B2, "v1|_v1", ""). Paste the formula down.
- Name column D add_underscore and type the following function into D2: =REGEXREPLACE(C2,"([a-z])(\d)", "$1_$2"). Paste the formula down. Notice that Google Sheets uses $1 instead of \\1 for replacing with a capture group.
- Make final edits in the new_name column.

7.15.2 Import the Google Sheet column name key

The first time you attempt to read-in a Google Sheet you will need to authenticate your Google user name and password. Make sure to click the check box next to “Grant Tidyverse API Packages permission to See, edit, create, delete, spreadsheets in Google Drive”.

gs_col_name_key <- read_sheet(
  "https://docs.google.com/spreadsheets/d/1cnoOvVdblnKarVf_WQIyWtJDIOumI9VSP1WK2gTy0gU/edit?usp=sharing",
  sheet = "visit1"
)

7.15.3 Reorder and rename the current data frame

You might be tempted to just rename the columns by doing something like this:

names(v1) <- gs_col_name_key$new_name

That will work as long as there are the exact same number of columns in v1 as there are in are in gs_col_name_key$new_name, and the order of the columns in v1 perfectly matches the order of gs_col_name_key$new_name. A slightly more complex, but safer, way to do this is with rename(). Using new_name = old_name inside the rename function make it less likely to accidentally assign the wrong new_name to a column.

7.15.3.1 Create a named vector

Create a named vector of old names and new names that you can pass to select() or rename().

rename_vector <- gs_col_name_key$old_name
names(rename_vector) <- gs_col_name_key$new_name
rename_vector

7.15.3.2 Reorder and rename the current data frame

Reorder and rename the current data frame using rename_vector.

v1 %>% 
  select(!!! rename_vector) %>% 
  DT::datatable(options = list(scrollX = TRUE))

7.15.3.3 More columns in df than Sheet

What if there are columns in the data frame that haven’t yet been added to the Google Sheet?

v1 %>% 
  rowwise() %>% 
  mutate(T_Total = sum(c_across(T26A_V1:T32B_V1))) %>% 
  ungroup() %>% 
  rename(!!! rename_vector) %>% 
  DT::datatable(options = list(scrollX = TRUE))

It just leaves the extra column’s name unchanged, which is exactly what we want.

7.15.3.4 More columns in Sheet and df

What if there are columns in the Google Sheet that aren’t in the data frame?

rename_vector_2 <- c(rename_vector, "mms_1" = "MMS1")

v1 %>% 
  rename(!!! rename_vector_2) %>% 
  DT::datatable(options = list(scrollX = TRUE))

## Error: object 'rename_vector_2' not found

R gives us a warning that we can work out.