<- data.frame(
class names = c("John", "Sally", "Brad", "Anne"),
heights = c(68, 63, 71, 72)
)
5 Let’s Get Programming
In this chapter, we are going to tie together many of the concepts we’ve learned so far, and you are going to create your first basic R program. Specifically, you are going to write a program that simulates some data and analyzes it.
5.1 Simulating data
Data simulation can be really complicated, but it doesn’t have to be. It is simply the process of creating data as opposed to finding data in the wild. This can be really useful in several different ways.
Simulating data is really useful for getting help with a problem you are trying to solve. Often, it isn’t feasible for you to send other people the actual data set you are working on when you encounter a problem you need help with. Sometimes, it may not even be legally allowed (i.e., for privacy reasons). Instead of sending them your entire data set, you can simulate a little data set that recreates the challenge you are trying to address without all the other complexity of the full data set. As a bonus,we have often found that we end up figuring out the solution to the problem we’re trying to solve as we recreate the problem in a simulated data set that we intended to share with others.
Simulated data can also be useful for learning about and testing statistical assumptions. In epidemiology, we use statistics to draw conclusions about populations of people we are interested in based on samples of people drawn from the population. Because we don’t actually have data from all the people in the population, we have to make some assumptions about the population based on what we find in our sample. When we simulate data, we know the truth about our population because we created our population to have that truth. We can then use this simulated population to play “what if” games with our analysis. What if we only sampled half as many people? What if their heights aren’t actually normally distributed? What if we used a probit model instead of a logit model? Going through this process and answering these questions can help us understand how much, and under what circumstances, we can trust the answers we found in the real world.
So, let’s go ahead and write a complete R program to simulate and analyze some data. As we said, it doesn’t have to be complicated. In fact, in just a few lines of R code below we simulate and analyze some data about a hypothetical class.
class
names heights
1 John 68
2 Sally 63
3 Brad 71
4 Anne 72
mean(class$heights)
[1] 68.5
As you can see, this data frame contains the students’ names and heights. We also use the mean()
function to calculate the average height of the class. By the end of this chapter, you will understand all the elements of this R code and how to simulate your own data.
5.2 Vectors
Vectors are the most fundamental data structure in R. Here, data structure means “container for our data.” There are other data structures as well; however, they are all built from vectors. That’s why we say vectors are the most fundamental data structure. Some of these other structures include matrices, lists, and data frames. In this book, we won’t use matrices or lists much at all, so you can forget about them for now. Instead, we will almost exclusively use data frames to hold and manipulate our data. However, because data frames are built from vectors, it can be useful to start by learning a little bit about them. Let’s create our first vector now.
# Create an example vector
<- c("John", "Sally", "Brad", "Anne")
names # Print contents to the screen
names
[1] "John" "Sally" "Brad" "Anne"
👆Here’s what we did above:
We created a vector of names with the
c()
(short for combine) function.The vector contains four values: “John”, “Sally”, “Brad”, and “Anne”.
All of the values are character strings (i.e., words). We know this because all of the values are wrapped with quotation marks.
Here we used double quotes above, but we could have also used single quotes. We cannot, however, mix double and single quotes for each character string. For example,
c("John', ...)
won’t work.
We assigned that vector of character strings to the word
names
using the<-
function.R now recognizes
names
as an object that we can do things with.R programmers may refer to the names object as “the names object”, “the names vector”, or “the names variable”. For our purposes, these all mean the same thing.
We printed the contents of the
names
object to the screen by typing the word “names”.- R returns (shows us) the four character values (“John” “Sally” “Brad” “Anne”) on the computer screen.
Try copying and pasting the code above into the RStudio console on your computer. You should notice the names vector appear in your global environment. You may also notice that the global environment pane gives you some additional information about this vector to the right of its name. Specifically, you should see chr [1:4] "John" "Sally" "Brad" "Anne"
. This is R telling us that names
is a character vector (chr
), with four values ([1:4]
), and the first four values are "John" "Sally" "Brad" "Anne"
.
5.2.1 Vector types
There are several different vector types, but each vector can have only one type. The type of the vector above was character. We can validate that with the typeof()
function like so:
typeof(names)
[1] "character"
The other vector types that we will use in this book are double, integer, and logical. Double vectors hold real numbers and integer vectors hold integers. Collectively, double vectors and integer vectors are known as numeric vectors. Logical vectors can only hold the values TRUE and FALSE. Here are some examples of each:
5.2.2 Double vectors
# A numeric vector
<- c(12.5, 13.98765, pi)
my_numbers my_numbers
[1] 12.500000 13.987650 3.141593
typeof(my_numbers)
[1] "double"
5.2.3 Integer vectors
Creating integer vectors involves a weird little quirk of the R language. For some reason, and we have no idea why, we must type an “L” behind the number to make it an integer.
# An integer vector - first attempt
<- c(1, 2, 3)
my_ints_1 my_ints_1
[1] 1 2 3
typeof(my_ints_1)
[1] "double"
# An integer vector - second attempt
# Must put "L" behind the number to make it an integer. No idea why they chose "L".
<- c(1L, 2L, 3L)
my_ints_2 my_ints_2
[1] 1 2 3
typeof(my_ints_2)
[1] "integer"
5.2.4 Logical vectors
# A logical vector
# Type TRUE and FALSE in all caps
<- c(TRUE, FALSE, TRUE)
my_logical my_logical
[1] TRUE FALSE TRUE
typeof(my_logical)
[1] "logical"
Rather than have an abstract discussion about the particulars of each of these vector types right now, we think it’s best to wait and learn more about them when they naturally arise in the context of a real challenge we are trying to solve with data. At this point, just having some vague idea that they exist is good enough.
5.2.5 Factor vectors
Above, we said that we would only work with three vector types in this book: double, integer, and logical. Technically, that is true. Factors aren’t technically a vector type (we will explain below) but calling them a vector type is close enough to true for our purposes. We will briefly introduce you to factors here, and then discuss them in more depth later in the chapter on [Numerical Descriptions of Categorical Variables]. We cover them in greater depth there because factors are most useful in the context of working with categorical data – data that is grouped into discrete categories. Some examples of categorical variables commonly seen in public health data are sex, race or ethnicity, and level of educational attainment.
In R, we can represent a categorical variable in multiple different ways. For example, let’s say that we are interested in recording people’s highest level of formal education completed in our data. The discrete categories we are interested in are:
1 = Less than high school
2 = High school graduate
3 = Some college
4 = College graduate
We could then create a numeric vector to record the level of educational attainment for four hypothetical people as shown below.
# A numeric vector of education categories
<- c(3, 1, 4, 1)
education_num education_num
[1] 3 1 4 1
But what is less-than-ideal about storing our categorical data this way? Well, it isn’t obvious what the numbers in education_num
mean. For the purposes of this example, we defined them above, but if we didn’t have that information then we would likely have no idea what categories the numbers represent.
We could also create a character vector to record the level of educational attainment for four hypothetical people as shown below.
# A character vector of education categories
<- c(
education_chr "Some college", "Less than high school", "College graduate",
"Less than high school"
) education_chr
[1] "Some college" "Less than high school" "College graduate"
[4] "Less than high school"
But this strategy also has a few limitations that we will discuss in in the chapter on [Numerical Descriptions of Categorical Variables]. For now, we just need to quickly learn how to create and identify factor vectors.
Typically, we don’t create factors from scratch. Instead, we typically convert (or “coerce”) an existing numeric or character vector into a factor. For example, we can coerce education_num
to a factor like this:
# Coerce education_num to a factor
<- factor(
education_num_f x = education_num,
levels = 1:4,
labels = c(
"Less than high school", "High school graduate", "Some college",
"College graduate"
)
) education_num_f
[1] Some college Less than high school College graduate
[4] Less than high school
4 Levels: Less than high school High school graduate ... College graduate
👆 Here’s what we did above:
We used the
factor()
function to create a new factor version ofeducation_num
.You can type
?factor
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
factor()
function is thex
argument. The value passed to thex
argument should be a vector of data. We passed theeducation_num
vector to thex
argument.The second argument to the
factor()
function is thelevels
argument. This argument tells R the unique values that the new factor variable can take. We used the shorthand1:4
to tell R thateducation_num_f
can take the unique values 1, 2, 3, or 4.The third argument to the
factor()
function is thelabels
argument. The value passed to thelabels
argument should be a character vector of labels (i.e., descriptive text) for each value in thelevels
argument. The order of the labels in the character vector we pass to thelabels
argument should match the order of the values passed to thelevels
argument. For example, the ordering oflevels
andlabels
above tells R that1
should be labeled with “Less than high school”,2
should be labeled with “High school graduate”, etc.
We used the assignment operator (
<-
) to save our new factor vector in our global environment aseducation_num_f
.If we had used the name
education_num
instead, then the previous values in theeducation_num
vector would have been replaced with the new values. That is sometimes what we want to happen. However, when it comes to creating factors, we typically keep the numeric version of the vector and create an additional factor version of the vector. We just often find that it can be useful to have both versions of the variable hanging around during the analysis process.We also use the
_f
naming convention in our code. That means that when we create a new factor vector, we name it the same thing the original vector was named with the addition of_f
(for factor) at the end.
We printed the vector to the screen. The values in
education_num_f
look similar to the character strings displayed ineducation_chr
. Notice, however, that the values no longer have quotes around them and R displaysLevels: Less than high school High school graduate Some college College graduate
below the data values. This is R telling us the possible categorical values that this factor could take on. This is a telltale sign that the vector being printed to the screen is a factor.
Interestingly, although R uses labels to make factors look like character vectors, they are still integer vectors under the hood. For example:
typeof(education_num_f)
[1] "integer"
And we can still view them as such.
as.numeric(education_num_f)
[1] 3 1 4 1
It is also possible to coerce character vectors to factors. For example, we can coerce education_chr
to a factor like so:
# Coerce education_chr to a factor
<- factor(
education_chr_f x = education_chr,
levels = c(
"Less than high school", "High school graduate", "Some college",
"College graduate"
)
) education_chr_f
[1] Some college Less than high school College graduate
[4] Less than high school
4 Levels: Less than high school High school graduate ... College graduate
👆 Here’s what we did above:
We coerced a character vector (
education_chr
) to a factor using thefactor()
function.Because the levels are character strings, there was no need to pass any values to the
labels
argument this time. Keep in mind, though, that the order of the values passed to thelevels
argument matters. It will be the order that the factor levels will be displayed in our analyses.
You might reasonably wonder why we would want to convert character vectors to factors, but we will save that discussion for the chapter on [Numerical Descriptions of Categorical Variables].
5.3 Data frames
Vectors are useful for storing a single characteristic where all the data is of the same type. However, in epidemiology, we typically want to store information about many different characteristics of whatever we happen to be studying. For example, we didn’t just want the names of the people in our class, we also wanted the heights. Of course, we can also store the heights in a vector like so:
<- c(68, 63, 71, 72)
heights heights
[1] 68 63 71 72
But this vector, in and of itself, doesn’t tell us which height goes with which person. When we want to create relationships between our vectors, we can use them to build a data frame. For example:
# Create a vector of names
<- c("John", "Sally", "Brad", "Anne")
names # Create a vector of heights
<- c(68, 63, 71, 72)
heights # Combine them into a data frame
<- data.frame(names, heights)
class # Print the data frame to the screen
class
names heights
1 John 68
2 Sally 63
3 Brad 71
4 Anne 72
👆Here’s what we did above:
We created a data frame with the
data.frame()
function.The first argument we passed to the
data.frame()
function was a vector of names that we previously created.The second argument we passed to the
data.frame()
function was a vector of heights that we previously created.
We assigned that data frame to the word
class
using the<-
function.R now recognizes
class
as an object that we can do things with.R programmers may refer to this class object as “the class object” or “the class data frame”. For our purposes, these all mean the same thing. We could also call it a data set, but that term isn’t used much in R circles.
We printed the contents of the
class
object to the screen by typing the word “class”.- R returns (shows us) the data frame on the computer screen.
Try copying and pasting the code above into the RStudio console on your computer. You should notice the class
data frame appear in your global environment. You may also notice that the global environment pane gives you some additional information about this data frame to the right of its name. Specifically, you should see 4 obs. of 2 variables
. This is R telling us that class
has four rows or observations (4 obs.
) and two columns or variables (2 variables
). If you click the little blue arrow to the left of the data frame’s name, you will see information about the individual vectors that make up the data frame.
As a shortcut, instead of creating individual vectors and then combining them into a data frame as we’ve done above, most R programmers will create the vectors (columns) directly inside of the data frame function like this:
# Create the class data frame
<- data.frame(
class names = c("John", "Sally", "Brad", "Anne"),
heights = c(68, 63, 71, 72)
# Closing parenthesis down here.
)
# Print the data frame to the screen
class
names heights
1 John 68
2 Sally 63
3 Brad 71
4 Anne 72
As you can see, both methods produce the exact same result. The second method, however, requires a little less typing and results in fewer objects cluttering up your global environment. What we mean by that is that the names
and heights
vectors won’t exist independently in your global environment. Rather, they will only exist as columns of the class
data frame.
You may have also noticed that when we created the names
and heights
vectors (columns) directly inside of the data.frame()
function we used the equal sign (=
) to assign values instead of the assignment arrow (<-
). This is just one of those quirky R exceptions we talked about in the chapter on speaking R’s language. In fact, =
and <-
can be used interchangeably in R. It is only by convention that we usually use <-
for assigning values, but use =
for assigning values to columns in data frames. we don’t know why this is the convention. If it were up to me, we wouldn’t do this. We would just pick =
or <-
and use it in all cases where we want to assign values. But, it isn’t up to me and we gave up on trying to fight it a long time ago. Your R programming life will be easier if you just learn to assign values this way – even if it’s dumb. 🤷
By definition, all columns in a data frame must have the same length (i.e., number of rows). That means that each vector you create when building your data frame must have the same number of values in it. For example, the class data frame above has four names and four heights. If we had only entered three heights, we would have gotten the following error: Error in data.frame(names = c("John", "Sally", "Brad", "Anne"), heights = c(68, : arguments imply differing number of rows: 4, 3
5.4 Tibbles
Tibbles are a data structure that come from another tidyverse package – the tibble
package. Tibbles are data frames and serve the same purpose in R that data frames serve; however, they are enhanced in several ways. 💪 You are welcome to look over the tibble documentation or the tibbles chapter in R for Data Science if you are interested in learning about all the differences between tibbles and data frames. For our purposes, there are really only a couple things we want you to know about tibbles right now.
First, tibbles are a part of the tibble
package – NOT base R. Therefore, we have to install and load either the tibble
package or the dplyr
package (which loads the tibble package for us behind the scenes) before we can create tibbles. we typically just load the dplyr
package.
# Install the dplyr package. YOU ONLY NEED TO DO THIS ONE TIME.
install.packages("dplyr")
# Load the dplyr package. YOU NEED TO DO THIS EVERY TIME YOU START A NEW R SESSION.
library(dplyr)
Second, we can create tibbles using one of three functions: as_tibble()
, tibble()
, or tribble()
. I’ll show you some examples shortly.
Third, try not to be confused by the terminology. Remember, tibbles are data frames. They are just enhanced data frames.
5.4.1 The as_tibble function
We use the as_tibble()
function to turn an already existing basic data frame into a tibble. For example:
# Create a data frame
<- data.frame(
my_df name = c("john", "alexis", "Steph", "Quiera"),
age = c(24, 44, 26, 25)
)
# Print my_df to the screen
my_df
name age
1 john 24
2 alexis 44
3 Steph 26
4 Quiera 25
# View the class of my_df
class(my_df)
[1] "data.frame"
👆Here’s what we did above:
We used the
data.frame()
function to create a new data frame calledmy_df
.We used the
class()
function to viewmy_df
’s class (i.e., what kind of object it is).- The result returned by the
class()
function tells us thatmy_df
is a data frame.
- The result returned by the
# Use as_tibble() to turn my_df into a tibble
<- as_tibble(my_df)
my_df
# Print my_df to the screen
my_df
# A tibble: 4 × 2
name age
<chr> <dbl>
1 john 24
2 alexis 44
3 Steph 26
4 Quiera 25
# View the class of my_df
class(my_df)
[1] "tbl_df" "tbl" "data.frame"
👆Here’s what we did above:
We used the
as_tibble()
function to turnmy_df
into a tibble.We used the
class()
function to viewmy_df
’s class (i.e., what kind of object it is).- The result returned by the
class()
function tells us thatmy_df
is still a data frame, but it is also a tibble. That’s what “tbl_df” and “tbl” mean.
- The result returned by the
5.4.2 The tibble function
We can use the tibble()
function in place of the data.frame()
function when we want to create a tibble from scratch. For example:
# Create a data frame
<- tibble(
my_df name = c("john", "alexis", "Steph", "Quiera"),
age = c(24, 44, 26, 25)
)
# Print my_df to the screen
my_df
# A tibble: 4 × 2
name age
<chr> <dbl>
1 john 24
2 alexis 44
3 Steph 26
4 Quiera 25
# View the class of my_df
class(my_df)
[1] "tbl_df" "tbl" "data.frame"
👆Here’s what we did above:
We used the
tibble()
function to create a new tibble calledmy_df
.We used the
class()
function to viewmy_df
’s class (i.e., what kind of object it is).- The result returned by the
class()
function tells us thatmy_df
is still a data frame, but it is also a tibble. That’s what “tbl_df” and “tbl” mean.
- The result returned by the
5.4.3 The tribble function
Alternatively, we can use the tribble()
function in place of the data.frame()
function when we want to create a tibble from scratch. For example:
# Create a data frame
<- tribble(
my_df ~name, ~age,
"john", 24,
"alexis", 44,
"Steph", 26,
"Quiera", 25
)
# Print my_df to the screen
my_df
# A tibble: 4 × 2
name age
<chr> <dbl>
1 john 24
2 alexis 44
3 Steph 26
4 Quiera 25
# View the class of my_df
class(my_df)
[1] "tbl_df" "tbl" "data.frame"
👆Here’s what we did above:
We used the
tribble()
function to create a new tibble calledmy_df
.We used the
class()
function to viewmy_df
’s class (i.e., what kind of object it is).- The result returned by the
class()
function tells us thatmy_df
is still a data frame, but it is also a tibble. That’s what “tbl_df” and “tbl” mean.
- The result returned by the
There is absolutely no difference between the tibble we created above with the
tibble()
function and the tibble we created above with thetribble()
function. The only difference between the two functions is the syntax we used to pass the column names and data values to each function.When we use the
tibble()
function, we pass the data values to the function horizontally as vectors. This is the same syntax that thedata.frame()
function expects us to use.When we use the
tribble()
function, we pass the data values to the function vertically instead. The only reason this function exists is because it can sometimes be more convenient to type in our data values this way. That’s it.Remember to type a tilde (“~”) in front of your column names when using the
tribble()
function. For example, type~name
instead ofname
. That’s how R knows you’re giving it a column name instead of a data value.
5.4.4 Why use tibbles
At this point, some students wonder, “If tibbles are just data frames, why use them? Why not just use the data.frame()
function?” That’s a fair question. As we have said multiple times already, tibbles are enhanced. However, we don’t believe that going into detail about those enhancements is going to be useful to most of you at this point – and may even be confusing. But, we will show you one quick example that’s pretty self-explanatory.
Let’s say that we are given some data that contains four people’s age in years. We want to create a data frame from that data. However, let’s say that we also want a column in our new data frame that contains those same ages in months. Well, we could do the math ourselves. We could just multiply each age in years by 12 (for the sake of simplicity, assume that everyone’s age in years is gathered on their birthday). But, we’d rather have R do the math for us. We can do so by asking R to multiply each value of the the column called age_years
by 12. Take a look:
# Create a data frame using the data.frame() function
<- data.frame(
my_df name = c("john", "alexis", "Steph", "Quiera"),
age_years = c(24, 44, 26, 25),
age_months = age_years * 12
)
Error in eval(expr, envir, enclos): object 'age_years' not found
Uh, oh! We got an error! This error says that the column age_years
can’t be found. How can that be? We are clearly passing the column name age_years
to the data.frame()
function in the code chunk above. Unfortunately, the data.frame()
function doesn’t allow us to create and refer to a column name in the same function call. So, we would need to break this task up into two steps if we wanted to use the data.frame()
function. Here’s one way we could do this:
# Create a data frame using the data.frame() function
<- data.frame(
my_df name = c("john", "alexis", "Steph", "Quiera"),
age_years = c(24, 44, 26, 25)
)
# Add the age in months column to my_df
<- my_df %>% mutate(age_months = age_years * 12)
my_df
# Print my_df to the screen
my_df
name age_years age_months
1 john 24 288
2 alexis 44 528
3 Steph 26 312
4 Quiera 25 300
Alternatively, we can use the tibble()
function to get the result we want in just one step like so:
# Create a data frame using the tibble() function
<- tibble(
my_df name = c("john", "alexis", "Steph", "Quiera"),
age_years = c(24, 44, 26, 25),
age_months = age_years * 12
)
# Print my_df to the screen
my_df
# A tibble: 4 × 3
name age_years age_months
<chr> <dbl> <dbl>
1 john 24 288
2 alexis 44 528
3 Steph 26 312
4 Quiera 25 300
In summary, tibbles are data frames. For the most part, we will use the terms “tibble” and “data frame” interchangeably for the rest of the book. However, remember that tibbles are enhanced data frames. Therefore, there are some things that we will do with tibbles that we can’t do with basic data frames.
5.5 Missing data
As indicated in the warning box at the end of the data frames section of this chapter, all columns in our data frames have to have the same length. So what do we do when we are truly missing information in some of our observations? For example, how do we create the class
data frame if we are missing Anne’s height for some reason?
In R, we represent missing data with an NA
. For example:
# Create the class data frame
data.frame(
names = c("John", "Sally", "Brad", "Anne"),
heights = c(68, 63, 71, NA) # Now we are missing Anne's height
)
names heights
1 John 68
2 Sally 63
3 Brad 71
4 Anne NA
Make sure you capitalize NA
and don’t use any spaces or quotation marks. Also, make sure you use NA
instead of writing "Missing"
or something like that.
By default, R considers NA
to be a logical-type value (as opposed to character or numeric). for example:
typeof(NA)
[1] "logical"
However, you can tell R to make NA
a different type by using one of the more specific forms of NA
. For example:
typeof(NA_character_)
[1] "character"
typeof(NA_integer_)
[1] "integer"
typeof(NA_real_)
[1] "double"
Most of the time, you won’t have to worry about doing this because R will take care of converting NA
for you. What do we mean by that? Well, remember that every vector can have only one type. So, when you add an NA
(logical by default) to a vector with double values as we did above (i.e., c(68, 63, 71, NA)
), that would cause you to have three double values and one logical value in the same vector, which is not allowed. Therefore, R will automatically convert the NA
to NA_real_
for you behind the scenes.
This is a concept known as “type coercion” and you can read more about it here if you are interested. As we said, most of the time you don’t have to worry about type coercion – it will happen automatically. But, sometimes it doesn’t and it will cause R to give you an error. we mostly encounter this when using the if_else()
and case_when()
functions, which we will discuss later.
5.6 Our first analysis
Congratulations on your new R programming skills. 🎉 You can now create vectors and data frames. This is no small thing. Basically, everything else we do in this book will start with vectors and data frames.
Having said that, just creating data frames may not seem super exciting. So, let’s round out this chapter with a basic descriptive analysis of the data we simulated. Specifically, let’s find the average height of the class.
You will find that in R there are almost always many different ways to accomplish a given task. Sometimes, choosing one over another is simply a matter of preference. Other times, one method is clearly more efficient and/or accurate than another. This is a point that will come up over and over in this book. Let’s use our desire to find the mean height of the class as an example.
5.6.1 Manual calculation of the mean
For starters, we can add up all the heights and divide by the total number of heights to find the mean.
68 + 63 + 71 + 72) / 4 (
[1] 68.5
👆Here’s what we did above:
We used the addition operator (+) to add up all the heights.
We used the division operator (/) to divide the sum of all the heights by 4 - the number of individual heights we added together.
We used parentheses to enforce the correct order of operations (i.e., make R do addition before division).
This works, but why might it not be the best approach? Well, for starters, manually typing in the heights is error prone. We can easily accidently press the wrong key. Luckily, we already have the heights stored as a column in the class
data frame. We can access or refer to a single column in a data frame using the dollar sign notation.
5.6.2 Dollar sign notation
$heights class
[1] 68 63 71 72
👆Here’s what we did above:
We used the dollar sign notation to access the
heights
column in theclass
data frame.- Dollar sign notation is just the data frame name, followed by the dollar sign, followed by the column name.
5.6.3 Bracket notation
Further, we can use bracket notation to access each value in a vector. we think it’s easier to demonstrate bracket notation than it is to describe it. For example, we could access the third value in the names vector like this:
# Create the heights vector
<- c(68, 63, 71, 72)
heights
# Bracket notation
# Access the third element in the heights vector with bracket notation
3] heights[
[1] 71
Remember, that data frame columns are also vectors. So, we can combine the dollar sign notation and bracket notation, to access each individual value of the height
column in the class
data frame. This will help us get around the problem of typing each individual height value. For example:
# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4
# Second way. Use dollar sign notation and bracket notation so that we don't
# have to type individual heights
$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4 (class
[1] 68.5
5.6.4 The sum function
The second method is better in the sense that we no longer have to worry about mistyping the heights. However, who wants to type class$heights[...]
over and over? What if we had a hundred numbers? What if we had a thousand numbers? This wouldn’t work. Luckily, there is a function that adds all the numbers contained in a numeric vector – the sum()
function. Let’s take a look:
# Create the heights vector
<- c(68, 63, 71, 72)
heights
# Add together all the individual heights with the sum function
sum(heights)
[1] 274
Remember, that data frame columns are also vectors. So, we can combine the dollar sign notation and sum()
function, to add up all the individual heights in the heights
column of the class
data frame. It looks like this:
# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4
# Second way. Use dollar sign notation and bracket notation so that we don't
# have to type individual heights
# (class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4
# Third way. Use dollar sign notation and sum function so that we don't have
# to type as much
sum(class$heights) / 4
[1] 68.5
👆Here’s what we did above:
We passed the numeric vector
heights
from theclass
data frame to thesum()
function using dollar sign notation.The
sum()
function returned the total value of all the heights added together.We divided the total value of the heights by four – the number of individual heights.
5.6.5 Nesting functions
!! Before we move on, we want to point out something that is actually kind of a big deal. In the third method above, we didn’t manually add up all the individual heights - R did this calculation for us. Further, we didn’t store the sum of the individual heights somewhere and then divide that stored value by 4. Heck, we didn’t even see what the sum of the individual heights were. Instead, the returned value from the sum function (274) was used directly in the next calculation (/ 4
) by R without us seeing the result. In other words, (68 + 63 + 71 + 72) / 4
, 274 / 4
, and sum(class$heights) / 4
are all exactly the same thing to R. However, the third method (sum(class$heights) / 4
) is much more scalable (i.e., adding a lot more numbers doesn’t make this any harder to do) and much less error prone. Just to be clear, the BIG DEAL is that we now know that the values returned by functions can be directly passed to other functions in exactly the same way as if we typed the values ourselves.
This concept, functions passing values to other functions is known as nesting functions. It’s called nesting functions because we can put functions inside of other functions.
“But, Brad, there’s only one function in the command sum(class$heights) / 4
– the sum()
function.” Really? Is there? Remember when we said that operators are also functions in R? Well, the division operator is a function. And, like all functions it can be written with parentheses like this:
# Writing the division operator as a function with parentheses
`/`(8, 4)
[1] 2
👆Here’s what we did above:
We wrote the division operator in its more function-looking form.
Because the division operator isn’t a letter, we had to wrap it in backticks (`).
The backtick key is on the top left corner of your keyboard near the escape key (esc).
The first argument we passed to the division function was the dividend (The number we want to divide).
The second argument we passed to the division function was the divisor (The number we want to divide by).
So, the following two commands mean exactly the same thing to R:
8 / 4
`/`(8, 4)
And if we use this second form of the division operator, we can clearly see that one function is nested inside another function.
`/`(sum(class$heights), 4)
[1] 68.5
👆Here’s what we did above:
We calculated the mean height of the class.
The first argument we passed to the division function was the returned value from the
sum()
function.The second argument we passed to the division function was the divisor (4).
This is kind of mind-blowing stuff the first time you encounter it. 🤯 we wouldn’t blame you if you are feeling overwhelmed or confused. The main points to take away from this section are:
Everything we do in R, we will do with functions. Even operators are functions, and they can be written in a form that looks function-like; however, we will almost never actually write them in that way.
Functions can be nested. This is huge because it allows us to directly pass returned values to other functions. Nesting functions in this way allows us to do very complex operations in a scalable way and without storing a bunch of unneeded values that are created in the intermediate steps of the operation.
The downside of nesting functions is that it can make our code difficult to read - especially when we nest many functions. Fortunately, we will learn to use the pipe operator (
%>%
) in the workflow basics part of this book. Once you get used to pipes, they will make nested functions much easier to read.
Now, let’s get back to our analysis…
5.6.6 The length function
We think most of us would agree that the third method we learned for calculating the mean height is preferable to the first two methods for most situations. However, the third method still requires us to know how many individual heights are in the heights
column (i.e., 4). Luckily, there is a function that tells us how many individual values are contained in a vector – the length()
function. Let’s take a look:
# Create the heights vector
<- c(68, 63, 71, 72)
heights
# Return the number of individual values in heights
length(heights)
[1] 4
Remember, that data frame columns are also vectors. So, we can combine the dollar sign notation and length()
function to automatically calculate the number of values in the heights
column of the class
data frame. It looks like this:
# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4
# Second way. Use dollar sign notation and bracket notation so that we don't
# have to type individual heights
# (class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4
# Third way. Use dollar sign notation and sum function so that we don't have
# to type as much
# sum(class$heights) / 4
# Fourth way. Use dollar sign notation with the sum function and the length
# function
sum(class$heights) / length(class$heights)
[1] 68.5
👆Here’s what we did above:
We passed the numeric vector
heights
from theclass
data frame to thesum()
function using dollar sign notation.The
sum()
function returned the total value of all the heights added together.We passed the numeric vector
heights
from theclass
data frame to thelength()
function using dollar sign notation.The
length()
function returned the total number of values in theheights
column.We divided the total value of the heights by the total number of values in the
heights
column.
5.6.7 The mean function
The fourth method above is definitely the best method yet. However, this need to find the mean value of a numeric vector is so common that someone had the sense to create a function that takes care of all the above steps for us – the mean()
function. And as you probably saw coming, we can use the mean function like so:
# First way to calculate the mean
# (68 + 63 + 71 + 72) / 4
# Second way. Use dollar sign notation and bracket notation so that we don't
# have to type individual heights
# (class$heights[1] + class$heights[2] + class$heights[3] + class$heights[4]) / 4
# Third way. Use dollar sign notation and sum function so that we don't have
# to type as much
# sum(class$heights) / 4
# Fourth way. Use dollar sign notation with the sum function and the length
# function
# sum(class$heights) / length(class$heights)
# Fifth way. Use dollar sign notation with the mean function
mean(class$heights)
[1] 68.5
Congratulations again! You completed your first analysis using R!
5.7 Some common errors
Before we move on, we want to briefly discuss a couple common errors that will frustrate many of you early in your R journey. You may have noticed that we went out of our way to differentiate between the heights
vector and the heights
column in the class
data frame. As annoying as that may have been, we did it for a reason. The heights
vector and the heights
column in the class
data frame are two separate things to the R interpreter, and you have to be very specific about which one you are referring to. To make this more concrete, let’s add a weight
column to our class
data frame.
$weight <- c(160, 170, 180, 190) class
👆Here’s what we did above:
- We created a new column in our data frame –
weight
– using dollar sign notation.
Now, let’s find the mean weight of the students in our class.
mean(weight)
Error in eval(expr, envir, enclos): object 'weight' not found
Uh, oh! What happened? Why is R saying that weight
doesn’t exist? We clearly created it above, right? Wrong. We didn’t create an object called weight in the code chunk above. We created a column called weight
in the object called class
in the code chunk above. Those are different things to R. If we want to get the mean of weight
we have to tell R that weight
is a column in class
like so:
mean(class$weight)
[1] 175
A related issue can arise when you have an object and a column with the same name but different values. For example:
# An object called scores
<- c(5, 9, 3)
scores
# A colummn in the class data frame called scores
$scores <- c(95, 97, 93, 100) class
If you ask R for the mean of scores
, R will give you an answer.
mean(scores)
[1] 5.666667
However, if you wanted the mean of the scores
column in the class
data frame, this won’t be the correct answer. Hopefully, you already know how to get the correct answer, which is:
mean(class$scores)
[1] 96.25
Again, the scores
object and the scores
column of the class
object are different things to R.
5.8 Summary
Wow! We covered a lot in this first part of the book on getting started with R and RStudio. Don’t feel bad if your head is swimming. It’s a lot to take-in. However, you should feel proud of the fact that you can already do some legitimately useful things with R. Namely, simulate and analyze data. In the next part of this book, we are going to discuss some tools and best practices that will make it easier and more efficient for you to write and share your R code. After that, we will move on to tackling more advanced programming and data analysis challenges.