Up until now, we have only ever looked at the overall mean of a continuous variable. For example, the mean height for the entire class. However, we often want to estimate the means within levels, or categories, of another variable. For example, we may want to look at the mean height within gender. Said another way, we want to know the mean height for men and separately the mean height for women.
More generally, in this lesson you will learn to perform bivariate analysis when the outcome is continuous and the predictor is categorical.
Typically in a situation such as this, all we need to do is apply the analytic methods we’ve already learned for a single continuous outcome, but apply them separately within levels of our categorical predictor variable. Below, we’ll walk through doing so with R. To start with, we will again use our previously collected class survey data.
# Load the Tidyverse package
library(tidyverse)
class <- tibble(
age = c(32, 30, 32, 29, 24, 38, 25, 24, 48, 29, 22, 29, 24, 28, 24, 25,
25, 22, 25, 24, 25, 24, 23, 24, 31, 24, 29, 24, 22, 23, 26, 23,
24, 25, 24, 33, 27, 25, 26, 26, 26, 26, 26, 27, 24, 43, 25, 24,
27, 28, 29, 24, 26, 28, 25, 24, 26, 24, 26, 31, 24, 26, 31, 34,
26, 25, 27, NA),
age_group = c(2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2,
2, 1, 1, 1, NA),
gender = c(2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1,
1, 1, 2, 1, NA),
ht_in = c(70, 63, 62, 67, 67, 58, 64, 69, 65, 68, 63, 68, 69, 66, 67, 65,
64, 75, 67, 63, 60, 67, 64, 73, 62, 69, 67, 62, 68, 66, 66, 62,
64, 68, NA, 68, 70, 68, 68, 66, 71, 61, 62, 64, 64, 63, 67, 66,
69, 76, NA, 63, 64, 65, 65, 71, 66, 65, 65, 71, 64, 71, 60, 62,
61, 69, 66, NA),
wt_lbs = c(216, 106, 145, 195, 143, 125, 138, 140, 158, 167, 145, 297, 146,
125, 111, 125, 130, 182, 170, 121, 98, 150, 132, 250, 137, 124,
186, 148, 134, 155, 122, 142, 110, 132, 188, 176, 188, 166, 136,
147, 178, 125, 102, 140, 139, 60, 147, 147, 141, 232, 186, 212,
110, 110, 115, 154, 140, 150, 130, NA, 171, 156, 92, 122, 102,
163, 141, NA),
bmi = c(30.99, 18.78, 26.52, 30.54, 22.39, 26.12, 23.69, 20.67, 26.29,
25.39, 25.68, 45.15, 21.56, 20.17, 17.38, 20.8, 22.31, 22.75,
26.62, 21.43, 19.14, 23.49, 22.66, 32.98, 25.05, 18.31, 29.13,
27.07, 20.37, 25.01, 19.69, 25.97, 18.88, 20.07, NA, 26.76,
26.97, 25.24, 20.68, 23.72, 24.82, 23.62, 18.65, 24.03, 23.86,
10.63, 23.02, 23.72, 20.82, 28.24, NA, 37.55, 18.88, 18.3,
19.13, 21.48, 22.59, 24.96, 21.63, NA, 29.35, 21.76, 17.97,
22.31, 19.27, 24.07, 22.76, NA),
bmi_3cat = c(3, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1,
1, 1, 3, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, NA, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, NA, 3, 1, 1, 1, 1, 1, 1, 1, NA, 2, 1,
1, 1, 1, 1, 1, NA)
) %>%
mutate(
age_group = factor(age_group, labels = c("Younger than 30", "30 and Older")),
gender = factor(gender, labels = c("Female", "Male")),
bmi_3cat = factor(bmi_3cat, labels = c("Normal", "Overweight", "Obese"))
) %>%
print()
We can describe our continuous outcome variables using the same methods we learned in previous lessons. However, this time we will use dplyr's
group_by()
function to calculate these statistics within subgroups of interests. For example:
class_summary <- class %>%
filter(!is.na(ht_in)) %>%
group_by(gender) %>%
summarise(
n = n(),
mean = mean(ht_in),
`standard deviation` = sd(ht_in),
min = min(ht_in),
max = max(ht_in)
) %>%
print()
Here’s what we did above:
We used base R’s statistical functions inside dplyr's
summarise()
function to calculate the number of observations, mean, standard deviation, minimum value and maximum value of height within levels of gender.
We used filter(!is.na(ht_in))
to remove all rows from the data that have a missing value for “ht_in”. If we had not done so, R would have returned a value of “NA” for mean, standard deviation, min, and max. Alternatively, we could have added the na.rm = TRUE
option to each of the mean()
, sd()
, min()
, and max()
functions.
We used group_by(gender)
to calculate our statistics of interest separately within each category of the variable “gender.” In this case, “Female” and “Male.”
You may notice that I used backticks around the variable name “standard deviation” – NOT single quotes. If you want to include a space in a variable name in R, you must surround it with backticks. In general, it’s a really bad idea to create variable names with spaces in them. I recommend only doing so in situations where you are using a data frame to display summary information, as we did above.
Notice too that I saved our summary statistics table as data frame named “class_summary.” Doing so is sometimes useful, especially for plotting as we will see below.
As you look over this table, you should have an idea of whether male or female students in the class appear to be taller on average, and whether male or female students in the class appear to have more dispersion around the mean value.
Finally, let’s plot this data to get a feel for the relationship between gender and height graphically.
class %>%
filter(!is.na(ht_in)) %>%
ggplot(aes(x = gender, y = ht_in)) +
geom_jitter(aes(col = gender), width = 0.20) +
geom_segment(
aes(x = c(0.75, 1.75), y = mean, xend = c(1.25, 2.25), yend = mean, col = gender),
size = 1.5, data = class_summary
) +
scale_x_discrete("Gender") +
scale_y_continuous("Height (Inches)") +
scale_color_manual(values = c("#BC581A", "#00519B")) +
theme_classic() +
theme(legend.position = "none", axis.text.x = element_text(size = 12))
Here’s what we did above:
We used ggplot()
to plot each student’s height as well as the mean heights of female and male students respectively.
The geom_jitter()
function plots a point for each student’s height, and then makes slight random adjustments to the location of the points so that they are less likely to overlap. One of the great things about plotting our data like this is that we can quickly see if there are many more observations in one category than another. That information would be obscured if we were to use a box plot.
The geom_segment()
function creates the two horizontal lines at the mean values of height. Notice we used a different data frame – class_summary – using the data = class_summary
argument to plot the mean values.
We changed the x and y axis titles using the scale_x_discrete()
and scale_y_continuous()
functions.
We changed the default ggplot colors to orange and blue using the scale_color_manual()
function.
We simplified the plot using the theme_classic()
function.
theme(legend.position = "none", axis.text.x = element_text(size = 12))
removed the legend and increased the size of the x-axis labels a little bit.
After checking both numerical and graphical descriptions of the relationship between gender and height we may conclude that male students were taller, on average, than female students.
At times we may be interested in comparing continuous outcomes across levels of two or more categorical variables. As an example, perhaps we want to describe BMI by gender and age group. All we have to do is add age group to the group_by()
function.
class_summary <- class %>%
filter(!is.na(bmi)) %>%
group_by(gender, age_group) %>%
summarise(
n = n(),
mean = mean(bmi),
`standard deviation` = sd(bmi),
min = min(bmi),
max = max(bmi)
) %>%
print()
And we can see these statistics for BMI within levels of gender separately for younger and older students. Males that are 30 and older report, on average, the highest BMI (28.6). Females age 30 and older report, on average, the lowest BMI (21.8). This is good information, but often when comparing groups, a picture really is worth a thousand words. Let’s wrap up this lesson with one final plot.
class %>%
filter(!is.na(bmi)) %>%
ggplot(aes(x = age_group, y = bmi)) +
facet_wrap(vars(gender)) +
geom_jitter(aes(col = age_group), width = 0.20) +
geom_segment(
aes(x = rep(c(0.75, 1.75), 2), y = mean, xend = rep(c(1.25, 2.25), 2), yend = mean,
col = age_group),
size = 1.5, data = class_summary
) +
scale_x_discrete("Age Group") +
scale_y_continuous("BMI") +
scale_color_manual(values = c("#BC581A", "#00519B")) +
theme_classic() +
theme(legend.position = "none", axis.text.x = element_text(size = 10))
Here’s what we did above:
facet_wrap(vars(gender))
to plot males and females on separate plot panels.