R for Epidemiology
Welcome
Acknowledgements
Introduction
Goals
Text conventions used in this book
Other reading
Contributing to R4Epi
Typos
Issues
About the Authors
Brad Cannell
Melvin Livingston
I Foundational Epidemiologic Concepts
1
Using R for Epidemiology
1.1
Measurement
1.1.1
Descriptive measures
1.2
Uncertainty
1.2.1
Statistical uncertainty
1.2.2
Uncertainty in the research process
1.2.3
Epistemological uncertainty
1.3
Study design
1.4
Summary
2
Populations and Samples
2.1
Open and closed populations
2.2
Other ways to define populations
2.3
Samples
2.4
Cohorts
2.5
Summary
3
Measures of Occurrence
3.1
Terminology
3.1.1
Prevalence and incidence
3.1.2
Point prevalence and period prevalence
3.2
Quantifying prevalence
3.2.1
Prevalence counts
3.2.2
Prevalence proportion
3.2.3
Prevalence Odds
3.3
Quantifying incidence
3.3.1
Incidence Count
3.3.2
Incidence proportion
3.3.3
Incidence Odds
3.3.4
Incidence Rate
4
Random Error in Measures
5
Creating Contingency Tables in R
6
Measures of Association
6.1
Exposures and outcomes
6.2
Contingency tables
6.3
Building contingency tables in R
6.3.1
Matrix dimensions
6.3.2
Matrix to contingency tables
6.3.3
Add row and column names
6.3.4
Add margins
6.4
Probabilities
6.4.1
Frequency probabilities
6.4.2
Conditional probabilities
6.5
Associations
6.5.1
Statistical independence and null values
6.6
Calculating measures of association in R
6.6.1
Incidence proportion ratios
6.6.2
Incidence proportion difference
6.6.3
Incidence odds ratio
6.6.4
Incidence rate ratio
6.6.5
Incidence rate difference
6.7
Summary
7
Time-to-event Analysis
8
Stratification
9
Standardization
10
Selection Bias
10.1
Direction of bias
10.2
Summary
11
Systematic Error in Measures
11.1
Misclassification
11.2
Direction of bias
11.3
Sensitivity
11.4
Non-diffrential misclassification
11.5
Differential misclassification
11.6
Precision and validity
11.7
Summary
12
Effect-measure Modification
12.1
Difference between effect modification and effect measure modification?
12.2
Difference between effect modification and statistical interaction
12.3
Assessing (exploring) effect modification
12.3.1
Homogeneity of Effects
12.3.2
Observed and Expected Joint Effects
12.4
What is different enough?
12.5
Key points
13
Missing Data
II Introduction to Regression Analysis
14
Introduction to Regression Analysis
14.1
Generalize linear models
14.1.1
The glm function
14.2
Regression intuition
15
Linear Regression
15.1
Continuous regressand and continuous regressor
15.1.1
Interpretation
15.2
Continuous regressand and categorical regressor
15.2.1
Interpretation
15.3
Waist circumference and deep abdominal adipose tissue example
15.3.1
Continuous regressor (waist circumference)
15.3.2
Categorical regressor (large waist)
16
Linear Regression
16.1
Categorical regressand continuous regressor
16.1.1
Interpretation
16.2
Categorical regressand categorical regressor
16.2.1
Interpretation
16.3
Elder mistreatment example
16.3.1
Categorical regressor (dementia)
16.3.2
Interpretation
16.3.3
Continuous regressor (age)
16.3.4
Interpretation
16.4
Assumptions
17
Poisson Regression
17.1
Count regressand continuous regressor
17.1.1
Interpretation
17.2
Count regressand categorical regressor
17.2.1
Interpretation
17.3
Number of drinks and personal problems example
17.3.1
Count regressand and continuous regressor
17.3.2
Interpretation
17.3.3
Count regressand categorical regressor
17.3.4
Interpretation
17.4
Assumptions
18
Cox Proportional Hazards Regression
19
Multilevel Models
20
Generalized Estimating Equations
III Predictive Analysis
21
Introduction to Predictive Analysis
IV Introduction to Causal Inference
22
Introduction to Causal Inference
23
Sufficient and Component Cause Diagrams
23.1
Summary
24
Introduction to Directed Acyclic Graphs
24.1
Basic DAG structures and vocabulary
24.2
Creating DAGs in R
24.3
Chains
24.4
Forks
24.5
Colliders
24.6
d-Separation Rules
24.6.1
Rule 1
24.6.2
Rule 2
24.6.3
Rule 3
24.6.4
Rule 4
24.7
Summary
25
Confounding
25.1
Ice cream and murder rate simulation
25.2
How do we detect confounding
25.2.1
Change in estimate criteria
25.2.2
Traditional criteria
25.2.3
Structural criteria
25.3
Confounding
26
Deconfounding
26.1
Randomization
26.2
Restriction
26.3
Matching
26.4
Stratification
26.5
Summary
27
Mediation
V Study Design
28
Experimental Studies
29
Cohort Studies
30
Case-control Studies
31
Cross-sectional Studies
32
Ecologic Studies
33
Quasi-experimental Studies
34
Meta-analysis
35
Power and Sample Size
VI Getting Started
36
Installing R and RStudio
36.1
Download and install on a Mac
36.2
Download and install on a PC
37
What is R?
37.1
What is data?
37.2
What is R?
37.2.1
Transferring data
37.2.2
Managing data
37.2.3
Analyzing data
37.2.4
Presenting data
38
Navigating the RStudio Interface
38.1
The console
38.2
The environment pane
38.3
The files pane
38.4
The source pane
38.5
RStudio preferences
39
Speaking R’s Language
39.1
R is a
language
39.2
The R interpreter
39.3
Errors
39.4
Functions
39.4.1
Passing values to function arguments
39.5
Objects
39.6
Comments
39.7
Packages
39.8
Programming style
40
Let’s Get Programming
40.1
Simulating data
40.2
Vectors
40.2.1
Vector types
40.2.2
Double vectors
40.2.3
Integer vectors
40.2.4
Logical vectors
40.2.5
Factor vectors
40.3
Data frames
40.4
Tibbles
40.4.1
The as_tibble function
40.4.2
The tibble function
40.4.3
The tribble function
40.4.4
Why use tibbles
40.5
Missing data
40.6
Our first analysis
40.6.1
Manual calculation of the mean
40.6.2
Dollar sign notation
40.6.3
Bracket notation
40.6.4
The sum function
40.6.5
Nesting functions
40.6.6
The length function
40.6.7
The mean function
40.7
Some common errors
40.8
Summary
41
Asking Questions
41.1
When should we seek help?
41.2
Where should we seek help?
41.3
How should we seek help?
41.3.1
Creating a post on Stack Overflow
41.3.2
Creating better posts and asking better questions
41.4
Helping others
41.5
Summary
VII Coding Tools and Best Practices
42
R Scripts
42.1
Creating R scripts
43
Quarto Files
43.1
What is Quarto?
43.2
Why use Quarto?
43.3
Create a Quarto file
43.4
YAML headers
43.5
R code chunks
43.6
Markdown
43.6.1
Markdown headings
43.7
Summary
44
R Projects
45
Coding Best Practices
45.1
General principles
45.2
Code comments
45.2.1
Defining key variables
45.2.2
What this code is trying to accomplish
45.2.3
Why I chose this particular strategy
45.3
Style guidelines
45.3.1
Comments
45.3.2
Object (variable) names
45.3.3
Use names that are informative
45.3.4
File Names
46
Using Pipes
46.1
What are pipes?
46.2
How do pipes work?
46.2.1
Keyboard shortcut
46.2.2
Pipe style
46.3
Final thought on pipes
VIII Data Transfer
47
Introduction to Data Transfer
48
File Paths
48.1
Finding file paths
48.2
Relative file paths
49
Importing Plain Text Files
49.1
Packages for importing data
49.2
Importing space delimited files
49.2.1
Specifying missing data values
49.3
Importing tab delimited files
49.4
Importing fixed width format files
49.4.1
Vector of column widths
49.4.2
Paired vector of start and end positions
49.4.3
Using named arguments
49.5
Importing comma separated values files
49.6
Additional arguments
50
Importing Binary Files
50.1
Packages for importing data
50.2
Importing Microsoft Excel spreadsheets
50.3
Importing data from other statistical analysis software
50.4
Importing SAS data sets
50.5
Importing Stata data sets
51
RStudio’s Data Import Tool
52
Exporting Data
52.1
Plain text files
52.2
R binary files
IX Descriptive Analysis
53
Introduction to Descriptive Analysis
53.1
What is descriptive analysis and why would we do it?
53.2
What kind of descriptive analysis should we perform?
54
Numerical Descriptions of Categorical Variables
54.1
Factors
54.1.1
Coerce a numeric variable
54.1.2
Coerce a character variable
54.2
Height and Weight Data
54.2.1
View the data
54.3
Calculating frequencies
54.3.1
The base R table function
54.3.2
The gmodels CrossTable function
54.3.3
The tidyverse way
54.4
Calculating percentages
54.5
Missing data
54.6
Formatting results
54.7
Using freqtables
55
Measures of Central Tendency
55.1
Calculate the mean
55.2
Calculate the median
55.3
Calculate the mode
55.4
Compare mean, median, and mode
55.5
Data checking
55.6
Properties of mean, median, and mode
55.7
Missing data
55.8
Using meantables
56
Measures of Dispersion
56.1
Comparing distributions
57
Describing the Relationship Between a Continuous Outcome and a Continuous Predictor
57.1
Pearson Correlation Coefficient
57.1.1
Calculating r
57.1.2
Correlation intuition
58
Describing the Relationship Between a Continuous Outcome and a Categorical Predictor
58.1
Single predictor and single outcome
58.2
Multiple predictors
59
Describing the Relationship Between a Categorical Outcome and a Categorical Predictor
59.1
Comparing two variables
X Data Management
60
Introduction to Data Management
60.1
Multiple paradigms for data management in R
60.2
The dplyr package
60.2.1
The dplyr verbs
60.2.2
The .data argument
60.2.3
The … argument
60.2.4
Non-standard evaluation
61
Creating and Modifying Columns
61.1
Creating data frames
61.2
Dollar sign notation
61.3
Bracket notation
61.4
Modify individual values
61.5
The mutate() function
61.5.1
Adding or modifying a single column
61.5.2
Recycling rules
61.5.3
Using existing variables in name-value pairs
61.5.4
Adding or modifying multiple columns
61.5.5
Rowwise mutations
61.5.6
Group_by mutations
62
Subsetting Data Frames
62.1
The select() function
62.2
The rename() function
62.3
The filter() function
62.3.1
Subgroup analysis
62.3.2
Complete case analysis
62.4
Deduplication
62.4.1
The distinct() function
62.4.2
Complete duplicate row add tag
62.4.3
Partial duplicate rows
62.4.4
Partial duplicate rows - add tag
62.4.5
Count the number of duplicates
62.4.6
What to do about duplicates
63
Working with Dates
63.1
Date vector types
63.2
Dates under the hood
63.3
Coercing date-times to dates
63.4
Coercing character strings to dates
63.5
Change the appearance of dates with format()
63.6
Some useful built-in dates
63.6.1
Today’s date
63.6.2
Today’s date-time
63.6.3
Character vector of full month names
63.6.4
Character vector of abbreviated month names
63.6.5
Creating a vector containing a sequence of dates
63.7
Calculating date intervals
63.7.1
Calculate age as the difference in time between dob and today
63.7.2
Rounding time intervals
63.8
Extracting out date parts
63.9
Sorting dates
64
Working with Character Strings
64.1
Coerce to lowercase
64.1.1
Lowercase
64.1.2
Upper case
64.1.3
Title case
64.1.4
Sentence case
64.2
Trim white space
64.3
Regular expressions
64.3.1
Remove the comma
64.3.2
Remove middle initial
64.3.3
Remove double spaces
64.4
Separate values into component parts
64.5
Dummy variables
65
Conditional Operations
65.1
Operands and operators
65.2
Testing multiple conditions simultaneously
65.3
Testing a sequence of conditions
65.4
Recoding variables
65.5
case_when() is lazy
65.6
Recode missing
66
Working with Multiple Data Frames
66.1
Combining data frames vertically: Adding rows
66.1.1
Combining more than 2 data frames
66.1.2
Adding rows with differing columns
66.1.3
Differing column positions
66.1.4
Differing column names
66.2
Combining data frames horizontally: Adding columns
66.2.1
Combining data frames horizontally by position
66.2.2
Combining data frames horizontally by key values
67
Restructuring Data frames
67.1
The tidyr package
67.2
Pivoting longer
67.2.1
The names_to argument
67.2.2
The names_prefix argument
67.2.3
The values_to argument
67.2.4
The names_transform argument
67.2.5
Pivoting multiple sets of columns
67.2.6
The names_sep argument
67.2.7
The .value special value
67.2.8
Why person-period?
67.3
Pivoting wider
67.3.1
Why person-level?
67.4
Pivoting summary statistics
67.4.1
Pivoting summary statistics wide to long
67.4.2
Pivoting summary statistics long to wide
67.5
Tidy data
67.5.1
Each variable must have its own column
67.5.2
Each observation must have its own row
67.5.3
Each value must have its own cell
67.6
The complete() function
XI Repeated Operations
68
Introduction to Repeated Operations
68.1
Multiple methods for repeated operations in R
68.2
Tidy evaluation
69
Writing Functions
69.1
When to write functions
69.2
How to write functions
69.2.1
The function() function
69.2.2
The function writing process
69.3
Giving your function arguments default values
69.4
The values your functions return
69.5
Lexical scoping and functions
69.6
Tidy evaluation
70
Column-wise Operations in dplyr
70.1
The across() function
70.2
Across with mutate
70.3
Across with summarise
70.4
Across with filter
70.5
Summary
71
Writing For Loops
71.1
How to write for loops
71.1.1
The for loop body
71.1.2
The for() function
71.2
Using for loops for data transfer
71.3
Using for loops for data management
71.4
Using for loops for analysis
72
Using the purrr Package
72.1
Comparing for loops and the map functions
72.2
Using purrr for data transfer
72.2.1
Example 1: Importing multiple sheets from an Excel workbook
72.2.2
Why walk instead of map?
72.2.3
why we didn’t assign the return value of
walk()
to an object?
72.3
Using purrr for data management
72.3.1
Example 1: Adding NA at multiple positions
72.3.2
Example 2. Detecting matching values by position
72.4
Using purrr for analysis
72.4.1
Example 1: Continuous statistics
72.4.2
Example 2: Categorical statistics
XII Collaboration
73
Introduction to git and GitHub
73.1
Versioning
73.2
Preservation
73.3
Reproducibility
73.4
Collaboration
73.5
Summary
74
Using git and GitHub
74.1
Install git
74.2
Sign up for a GitHub account
74.3
Install GitKraken
74.4
Example 1: Contribute to R4Epi
74.5
Example 2: Create a repository for a research project
Step 1: Create a repository on GitHub
Step 2: Clone the repository to your computer
Step 3: Add an R project file to the repository
Step 4: Update and commit gitignore
Step 5: Keep adding and committing files
74.6
Committing and pushing
74.7
Example 3: Contribute to a research project
74.7.1
Forking a repository
74.7.2
Creating a pull request
74.8
Summary
XIII Presenting Results
75
Creating Tables with R and Microsoft Word
75.1
Table 1
75.2
Opioid drug use
75.3
Table columns
75.4
Table rows
75.5
Make the table skeleton
75.6
Fill in column headers
75.6.1
Group sample sizes
75.6.2
Formatting column headers
75.7
Fill in row headers
75.7.1
Label statistics
75.7.2
Formatting row headers
75.8
Fill in data values
75.8.1
Manually type values
75.8.2
Copy and paste values
75.8.3
Knit a Word document
75.8.4
flextable and officer
75.8.5
Significant digits
75.8.6
Formatting data values
75.9
Fill in title
75.10
Fill in footnotes
75.10.1
Formatting footnotes
75.11
Final formatting
75.11.1
Adjust column widths
75.11.2
Merge cells
75.11.3
Remove cell borders
75.12
Summary
XIV Appendix
Appendix: Alternative table formats
75.13
Smaller data frame
75.13.1
Default method for printing the data frame to the screen
75.13.2
Using the kable function
75.13.3
Using the datatable function
75.14
Larger data frame
75.14.1
Default method for printing the data frame to the screen
75.14.2
Using the kable function
75.14.3
Using the datatable function
References
Published with bookdown
R for Epidemiology
9
Standardization
This chapter is under heavy development and may still undergo significant changes.