5  Let’s Get Programming

In this chapter, we are going to tie together many of the concepts we’ve learned so far, and you are going to create your first basic R program. Specifically, you are going to write a program that simulates some data and analyzes it.

5.1 Simulating data

Data simulation can be really complicated, but it doesn’t have to be. It is simply the process of creating data as opposed to finding data in the wild. This can be really useful in several different ways.

  1. Simulating data is really useful for getting help with a problem you are trying to solve. Often, it isn’t feasible for you to send other people the actual data set you are working on when you encounter a problem you need help with. Sometimes, it may not even be legally allowed (i.e., for privacy reasons). Instead of sending them your entire data set, you can simulate a little data set that recreates the challenge you are trying to address without all the other complexity of the full data set. As a bonus,we have often found that we end up figuring out the solution to the problem we’re trying to solve as we recreate the problem in a simulated data set that we intended to share with others.

  2. Simulated data can also be useful for learning about and testing statistical assumptions. In epidemiology, we use statistics to draw conclusions about populations of people we are interested in based on samples of people drawn from the population. Because we don’t actually have data from all the people in the population, we have to make some assumptions about the population based on what we find in our sample. When we simulate data, we know the truth about our population because we created our population to have that truth. We can then use this simulated population to play “what if” games with our analysis. What if we only sampled half as many people? What if their heights aren’t actually normally distributed? What if we used a probit model instead of a logit model? Going through this process and answering these questions can help us understand how much, and under what circumstances, we can trust the answers we found in the real world.

So, let’s go ahead and write a complete R program to simulate and analyze some data. As we said, it doesn’t have to be complicated. In fact, in just a few lines of R code below we simulate and analyze some data about a hypothetical class.

class <- data.frame(
  names   = c("John", "Sally", "Brad", "Anne"),
  heights = c(68, 63, 71, 72)
)
class
  names heights
1  John      68
2 Sally      63
3  Brad      71
4  Anne      72
mean(class$heights)
[1] 68.5

As you can see, this data frame contains the students’ names and heights. We also use the mean() function to calculate the average height of the class. By the end of this chapter, you will understand all the elements of this R code and how to simulate your own data.

5.2 Vectors

Vectors are the most fundamental data structure in R. Here, data structure means “container for our data.” There are other data structures as well; however, they are all built from vectors. That’s why we say vectors are the most fundamental data structure. Some of these other structures include matrices, lists, and data frames. In this book, we won’t use matrices or lists much at all, so you can forget about them for now. Instead, we will almost exclusively use data frames to hold and manipulate our data. However, because data frames are built from vectors, it can be useful to start by learning a little bit about them. Let’s create our first vector now.

# Create an example vector
names <- c("John", "Sally", "Brad", "Anne")
# Print contents to the screen
names
[1] "John"  "Sally" "Brad"  "Anne" 

👆Here’s what we did above:

  • We created a vector of names with the c() (short for combine) function.

    • The vector contains four values: “John”, “Sally”, “Brad”, and “Anne”.

    • All of the values are character strings (i.e., words). We know this because all of the values are wrapped with quotation marks.

    • Here we used double quotes above, but we could have also used single quotes. We cannot, however, mix double and single quotes for each character string. For example, c("John', ...) won’t work.

  • We assigned that vector of character strings to the word names using the <- function.

    • R now recognizes names as an object that we can do things with.

    • R programmers may refer to the names object as “the names object”, “the names vector”, or “the names variable”. For our purposes, these all mean the same thing.

  • We printed the contents of the names object to the screen by typing the word “names”.

    • R returns (shows us) the four character values (“John” “Sally” “Brad” “Anne”) on the computer screen.

Try copying and pasting the code above into the RStudio console on your computer. You should notice the names vector appear in your global environment. You may also notice that the global environment pane gives you some additional information about this vector to the right of its name. Specifically, you should see chr [1:4] "John" "Sally" "Brad" "Anne". This is R telling us that names is a character vector (chr), with four values ([1:4]), and the first four values are "John" "Sally" "Brad" "Anne".

5.2.1 Vector types

There are several different vector types, but each vector can have only one type. The type of the vector above was character. We can validate that with the typeof() function like so:

typeof(names)
[1] "character"

The other vector types that we will use in this book are double, integer, and logical. Double vectors hold real numbers and integer vectors hold integers. Collectively, double vectors and integer vectors are known as numeric vectors. Logical vectors can only hold the values TRUE and FALSE. Here are some examples of each:

5.2.2 Double vectors

# A numeric vector
my_numbers <- c(12.5, 13.98765, pi)
my_numbers
[1] 12.500000 13.987650  3.141593
typeof(my_numbers)
[1] "double"

5.2.3 Integer vectors

Creating integer vectors involves a weird little quirk of the R language. For some reason, and we have no idea why, we must type an “L” behind the number to make it an integer.

# An integer vector - first attempt
my_ints_1 <- c(1, 2, 3)
my_ints_1
[1] 1 2 3
typeof(my_ints_1)
[1] "double"
# An integer vector - second attempt
# Must put "L" behind the number to make it an integer. No idea why they chose "L".
my_ints_2 <- c(1L, 2L, 3L)
my_ints_2
[1] 1 2 3
typeof(my_ints_2)
[1] "integer"

5.2.4 Logical vectors

# A logical vector
# Type TRUE and FALSE in all caps
my_logical <- c(TRUE, FALSE, TRUE)
my_logical
[1]  TRUE FALSE  TRUE
typeof(my_logical)
[1] "logical"

Rather than have an abstract discussion about the particulars of each of these vector types right now, we think it’s best to wait and learn more about them when they naturally arise in the context of a real challenge we are trying to solve with data. At this point, just having some vague idea that they exist is good enough.

5.2.5 Factor vectors

Above, we said that we would only work with three vector types in this book: double, integer, and logical. Technically, that is true. Factors aren’t technically a vector type (we will explain below) but calling them a vector type is close enough to true for our purposes. We will briefly introduce you to factors here, and then discuss them in more depth later in the chapter on [Numerical Descriptions of Categorical Variables]. We cover them in greater depth there because factors are most useful in the context of working with categorical data – data that is grouped into discrete categories. Some examples of categorical variables commonly seen in public health data are sex, race or ethnicity, and level of educational attainment.

In R, we can represent a categorical variable in multiple different ways. For example, let’s say that we are interested in recording people’s highest level of formal education completed in our data. The discrete categories we are interested in are:

  • 1 = Less than high school

  • 2 = High school graduate

  • 3 = Some college

  • 4 = College graduate

We could then create a numeric vector to record the level of educational attainment for four hypothetical people as shown below.

# A numeric vector of education categories
education_num <- c(3, 1, 4, 1)
education_num
[1] 3 1 4 1

But what is less-than-ideal about storing our categorical data this way? Well, it isn’t obvious what the numbers in education_num mean. For the purposes of this example, we defined them above, but if we didn’t have that information then we would likely have no idea what categories the numbers represent.

We could also create a character vector to record the level of educational attainment for four hypothetical people as shown below.

# A character vector of education categories
education_chr <- c(
  "Some college", "Less than high school", "College graduate", 
  "Less than high school"
)
education_chr
[1] "Some college"          "Less than high school" "College graduate"     
[4] "Less than high school"

But this strategy also has a few limitations that we will discuss in in the chapter on [Numerical Descriptions of Categorical Variables]. For now, we just need to quickly learn how to create and identify factor vectors.

Typically, we don’t create factors from scratch. Instead, we typically convert (or “coerce”) an existing numeric or character vector into a factor. For example, we can coerce education_num to a factor like this:

# Coerce education_num to a factor
education_num_f <- factor(
  x      = education_num,
  levels = 1:4,
  labels = c(
    "Less than high school", "High school graduate", "Some college", 
    "College graduate"
  )
)
education_num_f
[1] Some college          Less than high school College graduate     
[4] Less than high school
4 Levels: Less than high school High school graduate ... College graduate

👆 Here’s what we did above:

  • We used the factor() function to create a new factor version of education_num.

    • You can type ?factor into your R console to view the help documentation for this function and follow along with the explanation below.

    • The first argument to the factor() function is the x argument. The value passed to the x argument should be a vector of data. We passed the education_num vector to the x argument.

    • The second argument to the factor() function is the levels argument. This argument tells R the unique values that the new factor variable can take. We used the shorthand 1:4 to tell R that education_num_f can take the unique values 1, 2, 3, or 4.

    • The third argument to the factor() function is the labels argument. The value passed to the labels argument should be a character vector of labels (i.e., descriptive text) for each value in the levels argument. The order of the labels in the character vector we pass to the labels argument should match the order of the values passed to the levels argument. For example, the ordering of levels and labels above tells R that 1 should be labeled with “Less than high school”, 2 should be labeled with “High school graduate”, etc.

  • We used the assignment operator (<-) to save our new factor vector in our global environment as education_num_f.

    • If we had used the name education_num instead, then the previous values in the education_num vector would have been replaced with the new values. That is sometimes what we want to happen. However, when it comes to creating factors, we typically keep the numeric version of the vector and create an additional factor version of the vector. We just often find that it can be useful to have both versions of the variable hanging around during the analysis process.

    • We also use the _f naming convention in our code. That means that when we create a new factor vector, we name it the same thing the original vector was named with the addition of _f (for factor) at the end.

  • We printed the vector to the screen. The values in education_num_f look similar to the character strings displayed in education_chr. Notice, however, that the values no longer have quotes around them and R displays Levels: Less than high school High school graduate Some college College graduate below the data values. This is R telling us the possible categorical values that this factor could take on. This is a telltale sign that the vector being printed to the screen is a factor.

Interestingly, although R uses labels to make factors look like character vectors, they are still integer vectors under the hood. For example:

typeof(education_num_f)
[1] "integer"

And we can still view them as such.

as.numeric(education_num_f)
[1] 3 1 4 1

It is also possible to coerce character vectors to factors. For example, we can coerce education_chr to a factor like so:

# Coerce education_chr to a factor
education_chr_f <- factor(
  x      = education_chr,
  levels = c(
    "Less than high school", "High school graduate", "Some college", 
    "College graduate"
  )
)
education_chr_f
[1] Some college          Less than high school College graduate     
[4] Less than high school
4 Levels: Less than high school High school graduate ... College graduate

👆 Here’s what we did above:

  • We coerced a character vector (education_chr) to a factor using the factor() function.

  • Because the levels are character strings, there was no need to pass any values to the labels argument this time. Keep in mind, though, that the order of the values passed to the levels argument matters. It will be the order that the factor levels will be displayed in our analyses.

You might reasonably wonder why we would want to convert character vectors to factors, but we will save that discussion for the chapter on [Numerical Descriptions of Categorical Variables].

5.3 Data frames

Vectors are useful for storing a single characteristic where all the data is of the same type. However, in epidemiology, we typically want to store information about many different characteristics of whatever we happen to be studying. For example, we didn’t just want the names of the people in our class, we also wanted the heights. Of course, we can also store the heights in a vector like so:

heights <- c(68, 63, 71, 72)
heights
[1] 68 63 71 72

But this vector, in and of itself, doesn’t tell us which height goes with which person. When we want to create relationships between our vectors, we can use them to build a data frame. For example:

# Create a vector of names
names <- c("John", "Sally", "Brad", "Anne")
# Create a vector of heights
heights <- c(68, 63, 71, 72)
# Combine them into a data frame
class <- data.frame(names, heights)
# Print the data frame to the screen
class
  names heights
1  John      68
2 Sally      63
3  Brad      71
4  Anne      72

👆Here’s what we did above:

  • We created a data frame with the data.frame() function.

    • The first argument we passed to the data.frame() function was a vector of names that we previously created.

    • The second argument we passed to the data.frame() function was a vector of heights that we previously created.

  • We assigned that data frame to the word class using the <- function.

    • R now recognizes class as an object that we can do things with.

    • R programmers may refer to this class object as “the class object” or “the class data frame”. For our purposes, these all mean the same thing. We could also call it a data set, but that term isn’t used much in R circles.

  • We printed the contents of the class object to the screen by typing the word “class”.

    • R returns (shows us) the data frame on the computer screen.

Try copying and pasting the code above into the RStudio console on your computer. You should notice the class data frame appear in your global environment. You may also notice that the global environment pane gives you some additional information about this data frame to the right of its name. Specifically, you should see 4 obs. of 2 variables. This is R telling us that class has four rows or observations (4 obs.) and two columns or variables (2 variables). If you click the little blue arrow to the left of the data frame’s name, you will see information about the individual vectors that make up the data frame.

As a shortcut, instead of creating individual vectors and then combining them into a data frame as we’ve done above, most R programmers will create the vectors (columns) directly inside of the data frame function like this:

# Create the class data frame
class <- data.frame(
  names   = c("John", "Sally", "Brad", "Anne"),
  heights = c(68, 63, 71, 72)
) # Closing parenthesis down here.

# Print the data frame to the screen
class
  names heights
1  John      68
2 Sally      63
3  Brad      71
4  Anne      72

As you can see, both methods produce the exact same result. The second method, however, requires a little less typing and results in fewer objects cluttering up your global environment. What we mean by that is that the names and heights vectors won’t exist independently in your global environment. Rather, they will only exist as columns of the class data frame.

You may have also noticed that when we created the names and heights vectors (columns) directly inside of the data.frame() function we used the equal sign (=) to assign values instead of the assignment arrow (<-). This is just one of those quirky R exceptions we talked about in the chapter on speaking R’s language. In fact, = and <- can be used interchangeably in R. It is only by convention that we usually use <- for assigning values, but use = for assigning values to columns in data frames. we don’t know why this is the convention. If it were up to me, we wouldn’t do this. We would just pick = or <- and use it in all cases where we want to assign values. But, it isn’t up to me and we gave up on trying to fight it a long time ago. Your R programming life will be easier if you just learn to assign values this way – even if it’s dumb. 🤷

Warning

By definition, all columns in a data frame must have the same length (i.e., number of rows). That means that each vector you create when building your data frame must have the same number of values in it. For example, the class data frame above has four names and four heights. If we had only entered three heights, we would have gotten the following error: Error in data.frame(names = c("John", "Sally", "Brad", "Anne"), heights = c(68, : arguments imply differing number of rows: 4, 3

5.4 Tibbles

Tibbles are a data structure that come from another tidyverse package – the tibble package. Tibbles are data frames and serve the same purpose in R that data frames serve; however, they are enhanced in several ways. 💪 You are welcome to look over the tibble documentation or the tibbles chapter in R for Data Science if you are interested in learning about all the differences between tibbles and data frames. For our purposes, there are really only a couple things we want you to know about tibbles right now.

First, tibbles are a part of the tibble package – NOT base R. Therefore, we have to install and load either the tibble package or the dplyr package (which loads the tibble package for us behind the scenes) before we can create tibbles. we typically just load the dplyr package.

# Install the dplyr package. YOU ONLY NEED TO DO THIS ONE TIME.
install.packages("dplyr")
# Load the dplyr package. YOU NEED TO DO THIS EVERY TIME YOU START A NEW R SESSION.
library(dplyr)

Second, we can create tibbles using one of three functions: as_tibble(), tibble(), or tribble(). I’ll show you some examples shortly.

Third, try not to be confused by the terminology. Remember, tibbles are data frames. They are just enhanced data frames.

5.4.1 The as_tibble function

We use the as_tibble() function to turn an already existing basic data frame into a tibble. For example:

# Create a data frame
my_df <- data.frame(
  name = c("john", "alexis", "Steph", "Quiera"),
  age  = c(24, 44, 26, 25)
)

# Print my_df to the screen
my_df
    name age
1   john  24
2 alexis  44
3  Steph  26
4 Quiera  25
# View the class of my_df
class(my_df)
[1] "data.frame"

👆Here’s what we did above:

  • We used the data.frame() function to create a new data frame called my_df.

  • We used the class() function to view my_df’s class (i.e., what kind of object it is).

    • The result returned by the class() function tells us that my_df is a data frame.
# Use as_tibble() to turn my_df into a tibble
my_df <- as_tibble(my_df)

# Print my_df to the screen
my_df
# A tibble: 4 × 2
  name     age
  <chr>  <dbl>
1 john      24
2 alexis    44
3 Steph     26
4 Quiera    25
# View the class of my_df
class(my_df)
[1] "tbl_df"     "tbl"        "data.frame"

👆Here’s what we did above:

  • We used the as_tibble() function to turn my_df into a tibble.

  • We used the class() function to view my_df’s class (i.e., what kind of object it is).

    • The result returned by the class() function tells us that my_df is still a data frame, but it is also a tibble. That’s what “tbl_df” and “tbl” mean.

5.4.2 The tibble function

We can use the tibble() function in place of the data.frame() function when we want to create a tibble from scratch. For example:

# Create a data frame
my_df <- tibble(
  name = c("john", "alexis", "Steph", "Quiera"),
  age  = c(24, 44, 26, 25)
)

# Print my_df to the screen
my_df
# A tibble: 4 × 2
  name     age
  <chr>  <dbl>
1 john      24
2 alexis    44
3 Steph     26
4 Quiera    25
# View the class of my_df
class(my_df)
[1] "tbl_df"     "tbl"        "data.frame"

👆Here’s what we did above:

  • We used the tibble() function to create a new tibble called my_df.

  • We used the class() function to view my_df’s class (i.e., what kind of object it is).

    • The result returned by the class() function tells us that my_df is still a data frame, but it is also a tibble. That’s what “tbl_df” and “tbl” mean.

5.4.3 The tribble function

Alternatively, we can use the tribble() function in place of the data.frame() function when we want to create a tibble from scratch. For example:

# Create a data frame
my_df <- tribble(
  ~name,    ~age,
  "john",   24, 
  "alexis", 44, 
  "Steph",  26,
  "Quiera", 25
)

# Print my_df to the screen
my_df
# A tibble: 4 × 2
  name     age
  <chr>  <dbl>
1 john      24
2 alexis    44
3 Steph     26
4 Quiera    25
# View the class of my_df
class(my_df)
[1] "tbl_df"     "tbl"        "data.frame"

👆Here’s what we did above:

  • We used the tribble() function to create a new tibble called my_df.

  • We used the class() function to view my_df’s class (i.e., what kind of object it is).

    • The result returned by the class() function tells us that my_df is still a data frame, but it is also a tibble. That’s what “tbl_df” and “tbl” mean.
  • There is absolutely no difference between the tibble we created above with the tibble() function and the tibble we created above with the tribble() function. The only difference between the two functions is the syntax we used to pass the column names and data values to each function.

    • When we use the tibble() function, we pass the data values to the function horizontally as vectors. This is the same syntax that the data.frame() function expects us to use.

    • When we use the tribble() function, we pass the data values to the function vertically instead. The only reason this function exists is because it can sometimes be more convenient to type in our data values this way. That’s it.

    • Remember to type a tilde (“~”) in front of your column names when using the tribble() function. For example, type ~name instead of name. That’s how R knows you’re giving it a column name instead of a data value.

5.4.4 Why use tibbles

At this point, some students wonder, “If tibbles are just data frames, why use them? Why not just use the data.frame() function?” That’s a fair question. As we have said multiple times already, tibbles are enhanced. However, we don’t believe that going into detail about those enhancements is going to be useful to most of you at this point – and may even be confusing. But, we will show you one quick example that’s pretty self-explanatory.

Let’s say that we are given some data that contains four people’s age in years. We want to create a data frame from that data. However, let’s say that we also want a column in our new data frame that contains those same ages in months. Well, we could do the math ourselves. We could just multiply each age in years by 12 (for the sake of simplicity, assume that everyone’s age in years is gathered on their birthday). But, we’d rather have R do the math for us. We can do so by asking R to multiply each value of the the column called age_years by 12. Take a look:

# Create a data frame using the data.frame() function
my_df <- data.frame(
  name       = c("john", "alexis", "Steph", "Quiera"),
  age_years  = c(24, 44, 26, 25),
  age_months = age_years * 12
)
Error in eval(expr, envir, enclos): object 'age_years' not found

Uh, oh! We got an error! This error says that the column age_years can’t be found. How can that be? We are clearly passing the column name age_years to the data.frame() function in the code chunk above. Unfortunately, the data.frame() function doesn’t allow us to create and refer to a column name in the same function call. So, we would need to break this task up into two steps if we wanted to use the data.frame() function. Here’s one way we could do this:

# Create a data frame using the data.frame() function
my_df <- data.frame(
  name       = c("john", "alexis", "Steph", "Quiera"),
  age_years  = c(24, 44, 26, 25)
)

# Add the age in months column to my_df
my_df <- my_df %>% mutate(age_months = age_years * 12)

# Print my_df to the screen
my_df
    name age_years age_months
1   john        24        288
2 alexis        44        528
3  Steph        26        312
4 Quiera        25        300

Alternatively, we can use the tibble() function to get the result we want in just one step like so:

# Create a data frame using the tibble() function
my_df <- tibble(
  name       = c("john", "alexis", "Steph", "Quiera"),
  age_years  = c(24, 44, 26, 25),
  age_months = age_years * 12
)

# Print my_df to the screen
my_df
# A tibble: 4 × 3
  name   age_years age_months
  <chr>      <dbl>      <dbl>
1 john          24        288
2 alexis        44        528
3 Steph         26        312
4 Quiera        25        300

In summary, tibbles are data frames. For the most part, we will use the terms “tibble” and “data frame” interchangeably for the rest of the book. However, remember that tibbles are enhanced data frames. Therefore, there are some things that we will do with tibbles that we can’t do with basic data frames.

5.5 Missing data

As indicated in the warning box at the end of the data frames section of this chapter, all columns in our data frames have to have the same length. So what do we do when we are truly missing information in some of our observations? For example, how do we create the class data frame if we are missing Anne’s height for some reason?

In R, we represent missing data with an NA. For example:

# Create the class data frame
data.frame(
  names   = c("John", "Sally", "Brad", "Anne"),
  heights = c(68, 63, 71, NA) # Now we are missing Anne's height
)
  names heights
1  John      68
2 Sally      63
3  Brad      71
4  Anne      NA
Warning

Make sure you capitalize NA and don’t use any spaces or quotation marks. Also, make sure you use NA instead of writing "Missing" or something like that.

By default, R considers NA to be a logical-type value (as opposed to character or numeric). for example:

typeof(NA)
[1] "logical"

However, you can tell R to make NA a different type by using one of the more specific forms of NA. For example:

typeof(NA_character_)
[1] "character"
typeof(NA_integer_)
[1] "integer"
typeof(NA_real_)
[1] "double"

Most of the time, you won’t have to worry about doing this because R will take care of converting NA for you. What do we mean by that? Well, remember that every vector can have only one type. So, when you add an NA (logical by default) to a vector with double values as we did above (i.e., c(68, 63, 71, NA)), that would cause you to have three double values and one logical value in the same vector, which is not allowed. Therefore, R will automatically convert the NA to NA_real_ for you behind the scenes.

This is a concept known as “type coercion” and you can read more about it here if you are interested. As we said, most of the time you don’t have to worry about type coercion – it will happen automatically. But, sometimes it doesn’t and it will cause R to give you an error. we mostly encounter this when using the if_else() and case_when() functions, which we will discuss later.

5.6 Our first analysis

Congratulations on your new R programming skills. 🎉 You can now create vectors and data frames. This is no small thing. Basically, everything else we do in this book will start with vectors and data frames.

Having said that, just creating data frames may not seem super exciting. So, let’s round out this chapter with a basic descriptive analysis of the data we simulated. Specifically, let’s find the average height of the class.

You will find that in R there are almost always many different ways to accomplish a given task. Sometimes, choosing one over another is simply a matter of preference. Other times, one method is clearly more efficient and/or accurate than another. This is a point that will come up over and over in this book. Let’s use our desire to find the mean height of the class as an example.

5.6.1 Manual calculation of the mean

For starters, we can add up all the heights and divide by the total number of heights to find the mean.

(68 + 63 + 71 + 72) / 4
[1] 68.5

👆Here’s what we did above:

  • We used the addition operator (+) to add up all the heights.

  • We used the division operator (/) to divide the sum of all the heights by 4 - the number of individual heights we added together.

  • We used parentheses to enforce the correct order of operations (i.e., make R do addition before division).

This works, but why might it not be the best approach? Well, for starters, manually typing in the heights is error prone. We can easily accidently press the wrong key. Luckily, we already have the heights stored as a column in the class data frame. We can access or refer to a single column in a data frame using the dollar sign notation.

5.6.2 Dollar sign notation

class$heights
[1] 68 63 71 72

👆Here’s what we did above:

  • We used the dollar sign notation to access the heights column in the class data frame.

    • Dollar sign notation is just the data frame name, followed by the dollar sign, followed by the column name.

5.6.3 Bracket notation

Further, we can use bracket notation to access each value in a vector. we think it’s easier to demonstrate bracket notation than it is to describe it. For example, we could access the third value in the names vector like this:

# Create the heights vector
heights <- c(68, 63, 71, 72)

# Bracket notation
# Access the third element in the heights vector with bracket notation
heights[3]
[1] 71

Remember, that data frame columns are also vectors. So, we can combine the dollar sign notation and bracket notation, to access each individual value of the height column in the class data frame. This will help us get around the problem of typing each individual height value. For example:

# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4

# Second way. Use dollar sign notation and bracket notation so that we don't 
# have to type individual heights
(class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4
[1] 68.5

5.6.4 The sum function

The second method is better in the sense that we no longer have to worry about mistyping the heights. However, who wants to type class$heights[...] over and over? What if we had a hundred numbers? What if we had a thousand numbers? This wouldn’t work. Luckily, there is a function that adds all the numbers contained in a numeric vector – the sum() function. Let’s take a look:

# Create the heights vector
heights <- c(68, 63, 71, 72)

# Add together all the individual heights with the sum function
sum(heights)
[1] 274

Remember, that data frame columns are also vectors. So, we can combine the dollar sign notation and sum() function, to add up all the individual heights in the heights column of the class data frame. It looks like this:

# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4

# Second way. Use dollar sign notation and bracket notation so that we don't 
# have to type individual heights
# (class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4

# Third way. Use dollar sign notation and sum function so that we don't have 
# to type as much
sum(class$heights) / 4
[1] 68.5

👆Here’s what we did above:

  • We passed the numeric vector heights from the class data frame to the sum() function using dollar sign notation.

  • The sum() function returned the total value of all the heights added together.

  • We divided the total value of the heights by four – the number of individual heights.

5.6.5 Nesting functions

!! Before we move on, we want to point out something that is actually kind of a big deal. In the third method above, we didn’t manually add up all the individual heights - R did this calculation for us. Further, we didn’t store the sum of the individual heights somewhere and then divide that stored value by 4. Heck, we didn’t even see what the sum of the individual heights were. Instead, the returned value from the sum function (274) was used directly in the next calculation (/ 4) by R without us seeing the result. In other words, (68 + 63 + 71 + 72) / 4, 274 / 4, and sum(class$heights) / 4 are all exactly the same thing to R. However, the third method (sum(class$heights) / 4) is much more scalable (i.e., adding a lot more numbers doesn’t make this any harder to do) and much less error prone. Just to be clear, the BIG DEAL is that we now know that the values returned by functions can be directly passed to other functions in exactly the same way as if we typed the values ourselves.

This concept, functions passing values to other functions is known as nesting functions. It’s called nesting functions because we can put functions inside of other functions.

“But, Brad, there’s only one function in the command sum(class$heights) / 4 – the sum() function.” Really? Is there? Remember when we said that operators are also functions in R? Well, the division operator is a function. And, like all functions it can be written with parentheses like this:

# Writing the division operator as a function with parentheses
`/`(8, 4)
[1] 2

👆Here’s what we did above:

  • We wrote the division operator in its more function-looking form.

    • Because the division operator isn’t a letter, we had to wrap it in backticks (`).

    • The backtick key is on the top left corner of your keyboard near the escape key (esc).

    • The first argument we passed to the division function was the dividend (The number we want to divide).

    • The second argument we passed to the division function was the divisor (The number we want to divide by).

So, the following two commands mean exactly the same thing to R:

8 / 4
`/`(8, 4)

And if we use this second form of the division operator, we can clearly see that one function is nested inside another function.

`/`(sum(class$heights), 4)
[1] 68.5

👆Here’s what we did above:

  • We calculated the mean height of the class.

    • The first argument we passed to the division function was the returned value from the sum() function.

    • The second argument we passed to the division function was the divisor (4).

This is kind of mind-blowing stuff the first time you encounter it. 🤯 we wouldn’t blame you if you are feeling overwhelmed or confused. The main points to take away from this section are:

  1. Everything we do in R, we will do with functions. Even operators are functions, and they can be written in a form that looks function-like; however, we will almost never actually write them in that way.

  2. Functions can be nested. This is huge because it allows us to directly pass returned values to other functions. Nesting functions in this way allows us to do very complex operations in a scalable way and without storing a bunch of unneeded values that are created in the intermediate steps of the operation.

  3. The downside of nesting functions is that it can make our code difficult to read - especially when we nest many functions. Fortunately, we will learn to use the pipe operator (%>%) in the workflow basics part of this book. Once you get used to pipes, they will make nested functions much easier to read.

Now, let’s get back to our analysis…

5.6.6 The length function

We think most of us would agree that the third method we learned for calculating the mean height is preferable to the first two methods for most situations. However, the third method still requires us to know how many individual heights are in the heights column (i.e., 4). Luckily, there is a function that tells us how many individual values are contained in a vector – the length() function. Let’s take a look:

# Create the heights vector
heights <- c(68, 63, 71, 72)

# Return the number of individual values in heights
length(heights)
[1] 4

Remember, that data frame columns are also vectors. So, we can combine the dollar sign notation and length() function to automatically calculate the number of values in the heights column of the class data frame. It looks like this:

# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4

# Second way. Use dollar sign notation and bracket notation so that we don't 
# have to type individual heights
# (class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4

# Third way. Use dollar sign notation and sum function so that we don't have 
# to type as much
# sum(class$heights) / 4

# Fourth way. Use dollar sign notation with the sum function and the length 
# function
sum(class$heights) / length(class$heights)
[1] 68.5

👆Here’s what we did above:

  • We passed the numeric vector heights from the class data frame to the sum() function using dollar sign notation.

  • The sum() function returned the total value of all the heights added together.

  • We passed the numeric vector heights from the class data frame to the length() function using dollar sign notation.

  • The length() function returned the total number of values in the heights column.

  • We divided the total value of the heights by the total number of values in the heights column.

5.6.7 The mean function

The fourth method above is definitely the best method yet. However, this need to find the mean value of a numeric vector is so common that someone had the sense to create a function that takes care of all the above steps for us – the mean() function. And as you probably saw coming, we can use the mean function like so:

# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4

# Second way. Use dollar sign notation and bracket notation so that we don't 
# have to type individual heights
# (class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4

# Third way. Use dollar sign notation and sum function so that we don't have 
# to type as much
# sum(class$heights) / 4

# Fourth way. Use dollar sign notation with the sum function and the length 
# function
# sum(class$heights) / length(class$heights)

# Fifth way. Use dollar sign notation with the mean function
mean(class$heights)
[1] 68.5

Congratulations again! You completed your first analysis using R!

5.7 Some common errors

Before we move on, we want to briefly discuss a couple common errors that will frustrate many of you early in your R journey. You may have noticed that we went out of our way to differentiate between the heights vector and the heights column in the class data frame. As annoying as that may have been, we did it for a reason. The heights vector and the heights column in the class data frame are two separate things to the R interpreter, and you have to be very specific about which one you are referring to. To make this more concrete, let’s add a weight column to our class data frame.

class$weight <- c(160, 170, 180, 190)

👆Here’s what we did above:

  • We created a new column in our data frame – weight – using dollar sign notation.

Now, let’s find the mean weight of the students in our class.

mean(weight)
Error in eval(expr, envir, enclos): object 'weight' not found

Uh, oh! What happened? Why is R saying that weight doesn’t exist? We clearly created it above, right? Wrong. We didn’t create an object called weight in the code chunk above. We created a column called weight in the object called class in the code chunk above. Those are different things to R. If we want to get the mean of weight we have to tell R that weight is a column in class like so:

mean(class$weight)
[1] 175

A related issue can arise when you have an object and a column with the same name but different values. For example:

# An object called scores
scores <- c(5, 9, 3)

# A colummn in the class data frame called scores
class$scores <- c(95, 97, 93, 100)

If you ask R for the mean of scores, R will give you an answer.

mean(scores)
[1] 5.666667

However, if you wanted the mean of the scores column in the class data frame, this won’t be the correct answer. Hopefully, you already know how to get the correct answer, which is:

mean(class$scores)
[1] 96.25

Again, the scores object and the scores column of the class object are different things to R.

5.8 Summary

Wow! We covered a lot in this first part of the book on getting started with R and RStudio. Don’t feel bad if your head is swimming. It’s a lot to take-in. However, you should feel proud of the fact that you can already do some legitimately useful things with R. Namely, simulate and analyze data. In the next part of this book, we are going to discuss some tools and best practices that will make it easier and more efficient for you to write and share your R code. After that, we will move on to tackling more advanced programming and data analysis challenges.