Appendix A — Glossary

Anchor: A regex metacharacter that pins a match to a position rather than characters: ^ matches the start of a string or line, and $ matches the end.
Arguments: Arguments always live inside the parentheses of R functions and receive information the function needs to generate the result we want.
Bivariate: Involving two variables.
Collapse: To reduce a data set to a smaller, summarised form by grouping and aggregating. In dplyr, summarise() collapses multiple rows per group into a single summary row.
Complete case analysis: A summary containing the count and percent of non-missing values for a categorical variable.
Console: The console is located in RStudio’s bottom-right pane by default. The R console is an interactive programming environment where we can enter and execute R commands. It’s the most basic interface for interacting with R, providing immediate feedback and results from the code we enter. The R console is useful for testing small pieces of code and interactive data exploration. However, we recommend using R scripts or Quarto files for all but the simplest programming or data-analysis tasks.
Conditional Operations: Programming steps that execute different code paths depending on whether a logical test is TRUE or FALSE. In R this includes the if / else family, vectorised helpers such as ifelse(), and higher-level wrappers like case_when().
Data Checks: Procedures used to confirm that the contents, structure, and metadata of a data set meet expected standards before analysis. Examples include type checks (numeric vs. character), range checks (no ages below 0), completeness checks (missing-value rates), and cross-field consistency checks (start ≤ end dates).
Data frame: R’s term for a data set or data table. A data frame is a two-dimensional structure made up of columns (variables) and rows (observations). All columns must have the same length.
For loop: A programming statement that executes a block of code repeatedly while systematically iterating over a predefined range, sequence, or collection of values.
Frequency count: Also called the frequency, the count, or simply n. It is a summary of data by the number of distinct occurrences of an event or observation.
Functions: In R, a function is an object that bundles a block of reusable code, optionally accepts arguments, and returns a result. Functions promote modularity, abstraction, and reproducibility. You create one with the function(...) { } construct or by assigning an anonymous function to a name.
Global environment: The top-level workspace of an R session (formally .GlobalEnv) that stores all objects the user creates interactively or sources from scripts. When you reference a name at the console, R searches the global environment first.
Issue (GitHub): GitHub’s documentation says issues are “items you can create in a repository to plan, discuss and track work. Issues are simple to create and flexible to suit a variety of scenarios. You can use issues to track work, give or receive feedback, collaborate on ideas or tasks, and efficiently communicate with others.”¹
Iteratively: A method of solving a problem by repeatedly executing a set of instructions in a step-by-step manner, often using loops. This approach can improve efficiency and help prevent errors.
Join: In dplyr and SQL, an operation that brings together two tables based on shared key columns. Variants include inner_join(), left_join(), right_join(), and full_join().
Key: One or more variables whose combined values uniquely identify a record (row) within a data set and enable accurate merges/joins with other tables.
Lexical scoping rules: Rules that define an object’s accessibility based on where it is declared in the code’s structure rather than how or when it is called.
List-wise Deletion: A missing-data handling method in which any observation (row) that contains at least one missing value on variables of interest is completely removed from the analysis, leaving only “complete cases.”
Long: A tidy-data format where repeated measures are stacked down rows rather than across columns: one column stores the measurement (value), another records the variable or occasion label, and each row corresponds to one observation at one time point for one entity.
MDL: Minimum Description Length (MDL) principle — an information-theoretic model-selection rule stating that the best explanation of data is the one that yields the shortest total code length required to describe both the model and the data given that model.
Marginal totals: In a contingency table, marginal totals are the sums of observations for each row (found in the last column) and each column (found in the last row). The overall total, found in the bottom-right cell, represents the sum of all observations in the table.
Mean: The arithmetic mean—often denoted $\bar{x}$—is calculated by summing all values in a numeric variable and dividing by the total number of values.
Median: The middle number in an ordered list of values. When the list has an even number of elements, the median is the average of the two middle numbers. Compared with the mean, the median is relatively resistant to extreme values.
Merge: A base-R term (function merge()) for combining two data frames by matching rows on one or more key variables. Rows that do not match can be kept, dropped, or produce missing values depending on the arguments.
Mode: The value that occurs most often in a set of data. A data set may be unimodal (one mode), multimodal (many modes), or have no mode (all values are equally frequent).
Non-standard Evaluation: In R, non-standard evaluation (NSE) is a set of language features that allow functions to capture, inspect, and manipulate the expressions that users write, rather than the values those expressions evaluate to. NSE powers the concise, “bare-variable-name” syntax in packages like dplyr (filter(df, age > 30)) and tidy-evaluation tools ({ }, .data[[ ]]) that let you combine NSE with standard evaluation in a controlled way.
Objects: Any data structure stored in R and bound to a name — including vectors, data frames, lists, functions, environments, models, or S3/S4 objects. Everything you create or load into R is an object.
Outcome variable: The variable whose value we are attempting to predict, estimate, or determine. Also called the dependent variable or response variable.
Pass: In programming lingo, we pass a value to a function argument—for example, in seq(from = 2, to = 100, by = 2) we pass 2 to from, 100 to to, and 2 to by.
Percentage: The relative frequency of an event expressed as a percentage. Calculated by dividing the number of occurrences by the total number of observations and multiplying by 100.
Person-level: A data structure in which each row represents exactly one person, typically containing time-invariant attributes (e.g., date of birth, sex) or pre-aggregated summaries.
Person-period: A panel/longitudinal data layout where each row corresponds to a unique combination of a person and a time period (e.g., person × year), holding variables that can change over time.
Predictor variable: The variable used to explain or help predict the value of the outcome variable. Also called the independent variable or explanatory variable.
Proportion: The relative frequency of an event or observation, often expressed as a fraction or percentage. Calculated by dividing the number of occurrences by the total number of observations.
Quantifier: A regex symbol that specifies how many times the preceding token should occur. Common quantifiers include * (0 +), + (1 +), ? (0 – 1), and {m,n} (between m and n times).
R: “R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.”² R is open source, and you can download it for free from The Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/.
Range: The difference between the maximum and minimum values in a data set.
Regular Expressions: Compact strings that describe search patterns for text. Regular expressions (regexes) are used for tasks such as finding, extracting, replacing, or validating character data (stringr, grepl(), gsub(), etc.).
Repository: “A repository contains all of your code, your files, and each file’s revision history. You can discuss and manage your work within the repository.”³ A repository can exist locally on your computer or remotely on a server such as GitHub.
Return: Instead of saying, “the seq() function gives us a sequence of numbers…,” we could say, “the seq() function returns a sequence of numbers.” In programming lingo, functions return one or more results.
RStudio: An integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor with direct code execution, and tools for plotting, debugging, and workspace management. RStudio is available as open-source desktop software and as server versions.⁴
Split - Apply - Combine: A data-analysis strategy that involves splitting data into smaller components, applying a calculation separately to each component, and then combining the individual results into a single result. dplyr::group_by() uses this strategy.
Standard deviation: A measure of spread equal to the square root of the variance—the average of the squared differences between each value and the mean.
Token: The smallest meaningful unit in a regular expression or in parsed text. Tokens include literal characters (a), metacharacters (\d), or entire character classes ([A-Z]).
Two-way frequency table: A table that summarises the relationship between two categorical variables by displaying the observed frequencies for their combinations. Also called a crosstab or contingency table.
Univariate: Involving a single numeric or a single categorical variable.
Variance: A measure of spread calculated as the average of the squared differences between each value and the mean.
Vectorization: A programming technique in which operations are applied to entire vectors (or matrices/data frames) in a single step rather than iterating element-by-element. Vectorized code in R (x * 2, mean(x)) is clear and fast because the heavy computation occurs in compiled code under the hood.
Wide: A data format where repeated measures or related observations are stored in separate columns (e.g., score_T1, score_T2, score_T3 for test scores across three terms).