10 Coding Best Practices
At this point in the book, we’ve talked a little bit about what R is. We’ve also talked about the RStudio IDE and took a quick tour around its four main panes. Finally, we wrote our first little R program, which simulated and analyzed some data about a hypothetical class. Writing and executing this R program officially made you an R programmer. 🏆
However, you should know that not all R code is equally “good” – even when it’s equally valid. What do I mean by that? Well, we already discussed the R interpreter and R syntax in the chapter on speaking R’s language. Any code that uses R syntax that the R interpreter can understand is valid R code. But, is the R interpreter the only one reading your R code? No way! In epidemiology, we collaborate with others all the time! That collaboration is going to be much more efficient and enjoyable when there is good communication – including R code that is easy to read and understand. Further, you will often need to read and/or reuse code you wrote weeks, months, or years after you wrote it. You may be amazed at how quickly you forget what you did and/or why you did it that way. Therefore, in addition to writing valid R code, this chapter is about writing “good” R code – code that easily and efficiently communicates ideas to humans.
Of course, “good code” is inevitably somewhat subjective. Reasonable people can have a difference of opinion about the best way to write code that is easy to read and understand. Additionally, reasonable people can have a difference of opinion about when code is “good enough.” For these reasons, I’m going to offer several “suggestions” about writing good R code below, but only two general principles, which I believe most R programmers would agree with.
10.1 General principles
Comment your code. Whether you intend to share your code with other people or not, make sure to write lots of comments about what you are trying to accomplish in each section of your code and why.
Use a style consistently. I’m going to suggest several guidelines for styling your R code below, but you may find that you prefer to style your R code in a different way. Whether you adopt my suggested style or not, please find or create a style that works for you and your collaborators and use it consistently.
10.2 Code comments
There isn’t a lot of specific advice that I can give here because comments are so idiosyncratic to the task at hand. So, I think the best I can do at this point is to offer a few examples for you to think about.
10.2.1 Defining key variables
As we will discuss below, variables should have names that are concise, yet informative. However, the data you receive in the real world will not always include informative variable names. Even when someone has given the variables informative names, there may still be contextual information about the variables that is important to understand for data management and analysis. Some data sets will come with something called a codebook or data dictionary. These are text files that contain information about the data set that are intended to provide you with some of that more detailed information. For example, the survey questions that were used to capture the values in each variable or what category each value in a categorical variable represents. However, real data sets don’t always come with a data dictionary, and even when they do, it can be convenient to have some of that contextual information close at hand, right next to your code. Therefore, I will sometimes comment my code with information about variables that are important for the analysis at hand. Here is an example from an administrative data set I was using for an analysis:
* **Case number definition**
- Case / investigation number.
* **Intake stage definition**
- An ID number assigned to the Intake. Each Intake (Report) has its
own number. A case may have more than one intake. For example, case # 12345
has two intakes associated with it, 9 days apart, each with their own ID
number. Each of the two intakes associated with this case have multiple
allegations.
* **Intake start definition**
- An intake is the submission or receipt of a report - a phone call or
web-based. The Intake Start Date refers to the date the staff member
opens a new record to begin recording the report.
10.2.2 What this code is trying to accomplish
Sometimes, it is obvious what a section of code literally does. but not so obvious why you’re doing it. I often try to write some comments around my code about what it’s trying to ultimately accomplish and why. For example:
## Standardize character strings
# Because we will merge this data with other data sets in the future based on
# character strings (e.g., name), we need to go ahead and standardize their
# formats here. This will prevent mismatches during the merges. Specifically,
# we:
# 1. Transform all characters to lower case
# 2. Remove any special characters (e.g., hyphens, periods)
# 3. Remove trailing spaces (e.g., "John Smith ")
# 4. Remove double spaces (e.g., "John Smith")
vars <- quos(full_name, first_name, middle_name, last_name, county, address, city)
client_data <- client_data %>%
mutate_at(vars(!!! vars), tolower) %>%
mutate_at(vars(!!! vars), stringr::str_replace_all, "[^a-zA-Z\\d\\s]", " ") %>%
mutate_at(vars(!!! vars), stringr::str_replace, "[[:blank:]]$", "") %>%
mutate_at(vars(!!! vars), stringr::str_replace_all, "[[:blank:]]{2,}", " ")
rm(vars)
10.2.3 Why I chose this particular strategy
In addition to writing comments about why I did something, I sometimes write comments about why I did it instead of something else. Doing this can save you from having to relearn lessons you’ve already learned through trial and error but forgot. For example:
### Create exact match dummy variables
* We reshape the data from long to wide to create these variables because it significantly decreases computation time compared to doing this as a group_by operation on the long data.
10.3 Style guidelines
UsInG c_o_n_s_i_s_t_e_n_t STYLE i.s. import-ant!
Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations… Good style is important because while your code only has one author, it’ll usually have multiple readers. This is especially true when you’re writing code with others. In that case, it’s a good idea to agree on a common style up-front. Since no style is strictly better than another, working with others may mean that you’ll need to sacrifice some preferred aspects of your style.5
Below, I outline the style that I and my collaborators typically use when writing R code for a research project. It generally follows the Tidyverse style guide, which I strongly suggest you read. Outside of my class, you don’t have to use my style, but you really should find or create a style that works for you and your collaborators and use it consistently.
10.3.2 Object (variable) names
In addition to the object naming guidance given in the Tidyverse style guide, I suggest the following object naming conventions.
10.3.3 Use names that are informative
Using names that are informative and easy to remember will make life easier for everyone who uses your data – including you!
10.3.3.1 Use names that are concise
You want names to be informative, but you don’t want them to be overly verbose. Really long names create more work for you and more opportunities for typos. In fact, I recommend using a single word when you can.
10.3.3.2 Use all lowercase letters
Remember, R is case-sensitive, which means that myStudyData and mystudydata are different things to R. Capitalizing letters in your file name just creates additional details to remember and potentially mess up. Just keep it simple and stick with lowercase letters.
10.3.3.3 Separate multiple words with underscores.
Sometimes you really just need to use multiple words to name your object. In those cases, I suggested separating words with an underscore.
# Multiple words running together - Hard to read - Don't use
mycancerdata
# Camel case - easier to read, but more to remember and mess up - Don't use
myCancerData
# Separate with periods - easier to read, but doesn't translate well to many
# other languages. For example, SAS won't accept variable names with
# periods - Don't use
my.cancer.data
# Separate with underscores - Use
my_cancer_data
10.3.3.4 Prefix the names of similar variables
When you have multiple related variables, it’s good practice to start their variable names with the same word. It makes these related variables easier to find and work with in the future if we need to do something with all of them at once. We can sort our variable names alphabetically to easily find find them. Additionally, we can use variable selectors like starts_with("name")
to perform some operation on all of them at once.
10.3.4 File Names
All the variable naming suggestons above also apply to file names. However, I make a few additional suggestions specific to file names below.
10.3.4.1 Managing multiple files in projects
When you are doing data management and analysis for real-world projects you will typically need to break the code up into multiple files. If you don’t, the code often becomes really difficult to read and manage. Having said that, finding the code you are looking for when there are 10, 20, or more separate files isn’t much fun either. Therefore, I suggest the following (or similar) file naming conventions be used in your projects.
Separate data cleaning and data analysis into separate files (typically, .R or .Rmd).
- Data cleaning files should be prefixed with the word “data” and named as follows
- data_[order number]_[purpose]
- Data cleaning files should be prefixed with the word “data” and named as follows
- Analysis files that do not directly create a table or figure should be
prefixed with the word “analysis” and named as follows
- analysis_[order number]_[brief summary of content]
- Analysis files that DO directly create a table or figure should be prefixed with the word “table” or “fig” respectively and named as follows
- table_[brief summary of content] or
- fig_[brief summary of content]
- table_[brief summary of content] or
🗒Side Note: I sometimes do data manipulation (create variables, subset data, reshape data) in an analysis file if that analysis (or table or chart) is the only analysis that uses the modified data. Otherwise, I do the modifications in a separate data cleaning file.
- Images
- Should typically be exported as png (especially when they are intended for use HTML files).
- Should typically be saved in a separate “img” folder under the project home directory.
- Should be given a descriptive name.
- Example:
histogram_heights.png
, NOTfig_02.png
.
- Example:
- I have found that the following image sizes typically work pretty well for my projects.
- 1920 x 1080 for HTML
- 770 x 360 for Word
- 1920 x 1080 for HTML
- Should typically be exported as png (especially when they are intended for use HTML files).
- Word and PDF output files
- I typically save them in a separate “docs” folder under the project home directory
- Whenever possible, I try to set the Word or PDF file name to match the name of the R file that it was created in.
- Example:
first_quarter_report.Rmd
createsdocs/first_quarter_report.pdf
- Example:
- Exported data files (i.e., RDS, RData, CSV, Excel, etc.)
- I typically save them in a separate “data” folder under the project home directory.
- Whenever possible, I try to set the Word or PDF file name to match the name of the R file that it was created in.
- Example:
data_03_texas_only.Rmd
createsdata/data_03_texas_only.csv
- Example:
10.3.1 Comments
Please put a space in between the pound/hash sign and the rest of your text when writing comments. For example,
# here is my comment
instead of#here is my comment
. It just makes the comment easier to read.