10 Coding Best Practices

At this point in the book, we’ve talked a little bit about what R is. We’ve also talked about the RStudio IDE and took a quick tour around its four main panes. Finally, we wrote our first little R program, which simulated and analyzed some data about a hypothetical class. Writing and executing this R program officially made you an R programmer. 🏆

However, you should know that not all R code is equally “good” – even when it’s equally valid. What do I mean by that? Well, we already discussed the R interpreter and R syntax in the chapter on speaking R’s language. Any code that uses R syntax that the R interpreter can understand is valid R code. But, is the R interpreter the only one reading your R code? No way! In epidemiology, we collaborate with others all the time! That collaboration is going to be much more efficient and enjoyable when there is good communication – including R code that is easy to read and understand. Further, you will often need to read and/or reuse code you wrote weeks, months, or years after you wrote it. You may be amazed at how quickly you forget what you did and/or why you did it that way. Therefore, in addition to writing valid R code, this chapter is about writing “good” R code – code that easily and efficiently communicates ideas to humans.

Of course, “good code” is inevitably somewhat subjective. Reasonable people can have a difference of opinion about the best way to write code that is easy to read and understand. Additionally, reasonable people can have a difference of opinion about when code is “good enough.” For these reasons, I’m going to offer several “suggestions” about writing good R code below, but only two general principles, which I believe most R programmers would agree with.

10.1 General principles

  1. Comment your code. Whether you intend to share your code with other people or not, make sure to write lots of comments about what you are trying to accomplish in each section of your code and why.

  2. Use a style consistently. I’m going to suggest several guidelines for styling your R code below, but you may find that you prefer to style your R code in a different way. Whether you adopt my suggested style or not, please find or create a style that works for you and your collaborators and use it consistently.

10.2 Code comments

There isn’t a lot of specific advice that I can give here because comments are so idiosyncratic to the task at hand. So, I think the best I can do at this point is to offer a few examples for you to think about.

10.2.1 Defining key variables

As we will discuss below, variables should have names that are concise, yet informative. However, the data you receive in the real world will not always include informative variable names. Even when someone has given the variables informative names, there may still be contextual information about the variables that is important to understand for data management and analysis. Some data sets will come with something called a codebook or data dictionary. These are text files that contain information about the data set that are intended to provide you with some of that more detailed information. For example, the survey questions that were used to capture the values in each variable or what category each value in a categorical variable represents. However, real data sets don’t always come with a data dictionary, and even when they do, it can be convenient to have some of that contextual information close at hand, right next to your code. Therefore, I will sometimes comment my code with information about variables that are important for the analysis at hand. Here is an example from an administrative data set I was using for an analysis:

* **Case number definition**

    - Case / investigation number.

* **Intake stage definition**

    - An ID number assigned to the Intake. Each Intake (Report) has its 
      own number. A case may have more than one intake. For example, case # 12345 
      has two intakes associated with it, 9 days apart, each with their own ID 
      number. Each of the two intakes associated with this case have multiple 
      allegations.

* **Intake start definition**

    - An intake is the submission or receipt of a report - a phone call or 
      web-based. The Intake Start Date refers to the date the staff member 
      opens a new record to begin recording the report.

10.2.2 What this code is trying to accomplish

Sometimes, it is obvious what a section of code literally does. but not so obvious why you’re doing it. I often try to write some comments around my code about what it’s trying to ultimately accomplish and why. For example:

## Standardize character strings

# Because we will merge this data with other data sets in the future based on 
# character strings (e.g., name), we need to go ahead and standardize their 
# formats here. This will prevent mismatches during the merges. Specifically, 
# we:

# 1. Transform all characters to lower case   
# 2. Remove any special characters (e.g., hyphens, periods)   
# 3. Remove trailing spaces (e.g., "John Smith ")   
# 4. Remove double spaces (e.g., "John  Smith")  

vars <- quos(full_name, first_name, middle_name, last_name, county, address, city)

client_data <- client_data %>% 
  mutate_at(vars(!!! vars), tolower) %>% 
  mutate_at(vars(!!! vars), stringr::str_replace_all, "[^a-zA-Z\\d\\s]", " ") %>%
  mutate_at(vars(!!! vars), stringr::str_replace, "[[:blank:]]$", "") %>% 
  mutate_at(vars(!!! vars), stringr::str_replace_all, "[[:blank:]]{2,}", " ")

rm(vars)

10.2.3 Why I chose this particular strategy

In addition to writing comments about why I did something, I sometimes write comments about why I did it instead of something else. Doing this can save you from having to relearn lessons you’ve already learned through trial and error but forgot. For example:

### Create exact match dummy variables

* We reshape the data from long to wide to create these variables because it significantly decreases computation time compared to doing this as a group_by operation on the long data. 

10.3 Style guidelines

UsInG c_o_n_s_i_s_t_e_n_t STYLE i.s. import-ant!

Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations… Good style is important because while your code only has one author, it’ll usually have multiple readers. This is especially true when you’re writing code with others. In that case, it’s a good idea to agree on a common style up-front. Since no style is strictly better than another, working with others may mean that you’ll need to sacrifice some preferred aspects of your style.5

Below, I outline the style that I and my collaborators typically use when writing R code for a research project. It generally follows the Tidyverse style guide, which I strongly suggest you read. Outside of my class, you don’t have to use my style, but you really should find or create a style that works for you and your collaborators and use it consistently.

10.3.1 Comments

Please put a space in between the pound/hash sign and the rest of your text when writing comments. For example, # here is my comment instead of #here is my comment. It just makes the comment easier to read.

10.3.2 Object (variable) names

In addition to the object naming guidance given in the Tidyverse style guide, I suggest the following object naming conventions.

10.3.3 Use names that are informative

Using names that are informative and easy to remember will make life easier for everyone who uses your data – including you!

# Uninformative names - Don't do this
x1
var1

# Informative names
employed
married
education

10.3.3.1 Use names that are concise

You want names to be informative, but you don’t want them to be overly verbose. Really long names create more work for you and more opportunities for typos. In fact, I recommend using a single word when you can.

# Write out entire name of the study the data comes from - Don't do this
womens_health_initiative

# Write out an acronym for the study the data comes from - assuming everyone 
# will be familiar with this acronym - Do this
whi

10.3.3.2 Use all lowercase letters

Remember, R is case-sensitive, which means that myStudyData and mystudydata are different things to R. Capitalizing letters in your file name just creates additional details to remember and potentially mess up. Just keep it simple and stick with lowercase letters.

# All upper case - so aggressive - Don't use
MYSTUDYDATA

# Camel case - Don't use
myStudyData

# All lowercase - Use
my_study_data

10.3.3.3 Separate multiple words with underscores.

Sometimes you really just need to use multiple words to name your object. In those cases, I suggested separating words with an underscore.

# Multiple words running together - Hard to read - Don't use
mycancerdata

# Camel case - easier to read, but more to remember and mess up - Don't use
myCancerData

# Separate with periods - easier to read, but doesn't translate well to many 
# other languages. For example, SAS won't accept variable names with 
# periods - Don't use
my.cancer.data

# Separate with underscores - Use
my_cancer_data

10.3.3.4 Prefix the names of similar variables

When you have multiple related variables, it’s good practice to start their variable names with the same word. It makes these related variables easier to find and work with in the future if we need to do something with all of them at once. We can sort our variable names alphabetically to easily find find them. Additionally, we can use variable selectors like starts_with("name") to perform some operation on all of them at once.

# Don't use
first_name
last_name
middle_name

# Use
name_first
name_last
name_middle

# Don't use
street
city
state

# Use
address_street
address_city
address_state

10.3.4 File Names

All the variable naming suggestons above also apply to file names. However, I make a few additional suggestions specific to file names below.

10.3.4.1 Managing multiple files in projects

When you are doing data management and analysis for real-world projects you will typically need to break the code up into multiple files. If you don’t, the code often becomes really difficult to read and manage. Having said that, finding the code you are looking for when there are 10, 20, or more separate files isn’t much fun either. Therefore, I suggest the following (or similar) file naming conventions be used in your projects.

  • Separate data cleaning and data analysis into separate files (typically, .R or .Rmd).

    • Data cleaning files should be prefixed with the word “data” and named as follows
      • data_[order number]_[purpose]
# Examples
data_01_import.Rmd
data_02_clean.Rmd
data_03_process_for_regression.Rmd
  • Analysis files that do not directly create a table or figure should be prefixed with the word “analysis” and named as follows
    • analysis_[order number]_[brief summary of content]
# Examples
analysis_01_exploratory.Rmd
analysis_02_regression.Rmd
  • Analysis files that DO directly create a table or figure should be prefixed with the word “table” or “fig” respectively and named as follows
    • table_[brief summary of content] or
    • fig_[brief summary of content]
# Examples
table_network_characteristics.Rmd
fig_reporting_patterns.Rmd

🗒Side Note: I sometimes do data manipulation (create variables, subset data, reshape data) in an analysis file if that analysis (or table or chart) is the only analysis that uses the modified data. Otherwise, I do the modifications in a separate data cleaning file.

  • Images
    • Should typically be exported as png (especially when they are intended for use HTML files).
    • Should typically be saved in a separate “img” folder under the project home directory.
    • Should be given a descriptive name.
      • Example: histogram_heights.png, NOT fig_02.png.
    • I have found that the following image sizes typically work pretty well for my projects.
      • 1920 x 1080 for HTML
      • 770 x 360 for Word
  • Word and PDF output files
    • I typically save them in a separate “docs” folder under the project home directory
    • Whenever possible, I try to set the Word or PDF file name to match the name of the R file that it was created in.
      • Example: first_quarter_report.Rmd creates docs/first_quarter_report.pdf
  • Exported data files (i.e., RDS, RData, CSV, Excel, etc.)
    • I typically save them in a separate “data” folder under the project home directory.
    • Whenever possible, I try to set the Word or PDF file name to match the name of the R file that it was created in.
      • Example: data_03_texas_only.Rmd creates data/data_03_texas_only.csv

References

5.
Wickham H. Style guide. In: Advanced R.; 2019.