Combining Data in R

Introduction

In this tutorial, we will be learning about a few different functions that can be used to combine two (or more) datasets within R. This is useful when you want to supplement your existing data with additional data in order to compute new variables that are relevant to your story.

Specifically, we will be covering the following functions from the tidyverse metapackage:

  • left_join()

  • right_join()

  • full_join()

  • inner_join()

We will then illustrate the utility of those functions by combining data from The Washington Post and the U.S. Census Bureau in order to calculate a per capita rate that could not be computed with a single dataset.

The Different Kinds of Join

The dplyr package (part of tidyverse) provides us with four different options for joining data. Simply put, a join operation takes the variables from two datasets and creates a new dataset that includes observations from the two datasets.

However, there are different ways to add those variables to the dataset. That is, depending on the type of join you use, you will get a resulting dataset that includes a different combination of observations. (If you are familiar with SQL joins, you will see that R generally handles them in similar ways.)

There are three key elements to any kind of join:

  • The left data frame (also called the primary or target data frame).

  • The right data frame (also called the secondary or supplemental data frame).

  • The key variable(s) (a variable, or set of variables, that appears in both data frames and allows you to correctly associate the observations/rows between the two data frames).

If this sounds confusing at first, rest assured that it will make more sense as we cover each join.

Our sample data

For demonstrative purposes, we are going to draw from two data frames.

The first one, my_pokedex, includes the following information about the pokémon I have collected in my travels:

my_pokedex
number name type
6 Charizard Fire
25 Pikachu Electric
37 Vulpix Fire
58 Growlithe Fire
65 Alakazam Psychic
145 Zapdos Electric

The second data frame, pokemon_types, includes information about the strengths and weakenesses of three types of pokémon.

pokemon_types
type strengths weaknesses
Fire Grass, Ice, Bug, Steel Water, Ground, Rock
Electric Water, Flying Ground
Water Fire, Ground, Rock Electric, Grass

For demonstrative purposes, my_pokedex will be the primary/target data frame (it will appear as the x argument in my *_join functions) and pokemon_types will be the secondary/supplemental data frame (the y argument).

The key/matching variable that is common to both data frames is type (the by argument).

Don’t forget to load tidyverse

Remember, all of our *_join() functions come from the dplyr package (that is loaded by the tidyverse metapackage). We have to make sure we load it or we will get errors telling us that the function cannot be found.

library(tidyverse)
The left join

The most common way to combine two datasets is to use the left_join() function. With this type of join, you will keep all the observations in the primary data frame and drop all of the observations in the secondary data frame that are not matched.

Here is what happens when I perform a left join of my two data frames:

left_join(x=my_pokedex, y=pokemon_types, by="type")
number name type strengths weaknesses
6 Charizard Fire Grass, Ice, Bug, Steel Water, Ground, Rock
25 Pikachu Electric Water, Flying Ground
37 Vulpix Fire Grass, Ice, Bug, Steel Water, Ground, Rock
58 Growlithe Fire Grass, Ice, Bug, Steel Water, Ground, Rock
65 Alakazam Psychic NA NA
145 Zapdos Electric Water, Flying Ground

The *_join() functions expect the key(s) (by argument) to be supplied as a string (or a vector of strings). Put another way, we need to enclose them in quotation marks or it will give us an error.

Notice that it kept all of my six observations from my_pokedex and added two variables (strengths and weaknesses) from my pokemon_types data frame. Every time there was a “Fire” or “Electric” type pokémon, it appended the corresponding information from the pokemon_types data frame.

However, since there was no row in my pokemon_types data frame with the value “Psychic” under type, it appended missing values (NA) under the new variables for that observation. Similarly, since there was no row in my my_pokedex data frame that had the value “Water” under the variable type, that corresponding information from pokemon_types was dropped (i.e., it did not append any of our “Water” strengths and weaknesses).

The right join

The right_join() function works in similar fashion to left_join(), except it includes all observations from the secondary data frame (including those that have no matching value for the primary data frame) and drops the non-matched observations from the primary data frame.

Here is what happens when I perform a right join of my two data frames:

right_join(x=my_pokedex, y=pokemon_types, by="type")
number name type strengths weaknesses
6 Charizard Fire Grass, Ice, Bug, Steel Water, Ground, Rock
25 Pikachu Electric Water, Flying Ground
37 Vulpix Fire Grass, Ice, Bug, Steel Water, Ground, Rock
58 Growlithe Fire Grass, Ice, Bug, Steel Water, Ground, Rock
145 Zapdos Electric Water, Flying Ground
NA NA Water Fire, Ground, Rock Electric, Grass

Notice that I no longer have an observation for the “Psychic” pokémon originally found in the my_pokedex data frame. I do, however, have an observation for the “Water” pokémon from the pokemon_types data frame (which naturally has NA values for the first two variables, as there were no such pokémon in my my_pokedex data frame).

The full join

The full_join() function is useful when you want to create a resulting data frame that includes all observations from both the primary and secondary data frames. Whenever there are non-matching observations from either data frame, it will fill in those values as NA.

Here is what happens when I perform a full join of my two data frames:

full_join(x=my_pokedex, y=pokemon_types, by="type")
number name type strengths weaknesses
6 Charizard Fire Grass, Ice, Bug, Steel Water, Ground, Rock
25 Pikachu Electric Water, Flying Ground
37 Vulpix Fire Grass, Ice, Bug, Steel Water, Ground, Rock
58 Growlithe Fire Grass, Ice, Bug, Steel Water, Ground, Rock
65 Alakazam Psychic NA NA
145 Zapdos Electric Water, Flying Ground
NA NA Water Fire, Ground, Rock Electric, Grass

Notice that the resulting dataframe includes all possible type values (i.e., from either data frame). For the “Psychic” type that is only present in the my_pokedex data frame, it appends NA values to the variables from the pokemon_types data frame. For the “Water” type that is only present in the pokemon_types data frame, it appends NA values to the variables from the my_pokedex data frame.

The inner join

Our final join function is the inner_join(), which is useful when you want to create a resulting data frame that includes only the observations found in both the primary and secondary data frames. Any observations that are not common to the two datasets will be dropped.

Here is what happens when I perform a full join of my two data frames:

inner_join(x=my_pokedex, y=pokemon_types, by="type")
number name type strengths weaknesses
6 Charizard Fire Grass, Ice, Bug, Steel Water, Ground, Rock
25 Pikachu Electric Water, Flying Ground
37 Vulpix Fire Grass, Ice, Bug, Steel Water, Ground, Rock
58 Growlithe Fire Grass, Ice, Bug, Steel Water, Ground, Rock
145 Zapdos Electric Water, Flying Ground

Notice that we do not have any observations for either the “Psychic” pokémon that was only present in the my_pokedex data frame or the “Water” pokémon that was only present in the pokemon_types data frame.

Other useful filtering joins

There are two other potentially useful filtering joins for comparing dataframes. These are less commonly used in practice but can nevertheless be useful for finding the presence and absence of matches.

The first is the semi_join(), which only returns the rows from the primary data frame that match those in the secondary data frame (without adding any new variables or values):

semi_join(x=my_pokedex, y=pokemon_types, by="type")
number name type
6 Charizard Fire
25 Pikachu Electric
37 Vulpix Fire
58 Growlithe Fire
145 Zapdos Electric

The second is the anti_join(), which only returns the rows from the primary data frame that do not match those in the secondary data frame (again, without adding any new variables or values).

anti_join(x=my_pokedex, y=pokemon_types, by="type")
number name type
65 Alakazam Psychic

Combining Police Shootings Data

To illustrate the value of combining datasets in the real world, we’ll start by using data from the award-winning Washington Post database on fatal police shootings. These data can be loaded here:

police_shootings <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/fatal_police_shootings_2020.csv")
## Rows: 5935 Columns: 17

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): name, manner_of_death, armed, gender, race, city, state, threat_le...
## dbl  (4): id, age, longitude, latitude
## lgl  (3): signs_of_mental_illness, body_camera, is_geocoding_exact
## date (1): date

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here’s what these data look like:

head(police_shootings)
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera longitude latitude is_geocoding_exact
3 Tim Elliot 2015-01-02 shot gun 53 M A Shelton WA TRUE attack Not fleeing FALSE -123.122 47.247 TRUE
4 Lewis Lee Lembke 2015-01-02 shot gun 47 M W Aloha OR FALSE attack Not fleeing FALSE -122.892 45.487 TRUE
5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23 M H Wichita KS FALSE other Not fleeing FALSE -97.281 37.695 TRUE
8 Matthew Hoffman 2015-01-04 shot toy weapon 32 M W San Francisco CA TRUE attack Not fleeing FALSE -122.422 37.763 TRUE
9 Michael Rodriguez 2015-01-04 shot nail gun 39 M H Evans CO FALSE attack Not fleeing FALSE -104.692 40.384 TRUE
11 Kenneth Joe Brown 2015-01-04 shot gun 18 M W Guthrie OK FALSE attack Not fleeing FALSE -97.423 35.877 TRUE

As you can see, we have a different observation for each time an individual was killed by police. Each observation includes useful information, including the date of the incident (date), the victim’s gender (gender) and race/ethnicity (race), and the location of the incident (city and state).

With these data, we could easily write a story about the number of incidents in which a police officer has killed an unarmed person.

As a journalist in Massachusetts, I may decide that I want to write a story examining the number of fatal shootings that occurred in the Commonwealth in 2020, and compare that frequency with those of other states to get a sense of where Massachusetts ranks when it comes to this issue.

Here’s one way to get that information:

police_shootings_by_state <- police_shootings %>%
  filter(date>="2020-01-01" & date<="2020-12-31") %>%
  group_by(state) %>%
  count() %>%
  arrange(desc(n)) %>%
  ungroup()

I am storing the information in a new data frame object called police_shootings_by_state so I can quickly reference these data later.

Here’s the result of my operation:

police_shootings_by_state
state n
CA 144
FL 92
TX 81
AZ 44
CO 42
GA 40
NC 34
WA 32
TN 28
OK 26
MO 25
NV 25
IN 24
OH 23
KY 20
LA 20
NY 19
PA 19
UT 17
WI 17
AL 16
MI 16
NM 16
OR 15
VA 15
MD 14
SC 14
IL 13
AR 11
MS 10
MT 10
AK 9
MN 8
NJ 8
KS 7
IA 6
ID 6
MA 6
WV 6
CT 5
ME 4
NE 4
NH 4
ND 3
DE 2
HI 2
SD 2
DC 1
WY 1

As we can see, California, Florida, and Texas rank as having the most fatal police shootings. Conversely, the District of Columbia, Wyoming, and Delaware had the fewest. (Massachusetts is tied for 11th in the country.)

Immediately, I see that the states with the most incidents are among the most populous ones in the country, and the ones with the fewest are the least populous states. To make a more apples-to-apples comparison, it would be helpful to calculate a per capita rate that takes the state’s population size into account.

Getting population data

In order to do that, though, I need data about the population size in each state. Unfortunately, my police_shootings data do not contain such information. I thus need to combine it with an additional dataset that contains population information for each state.

One such data source would be the U.S. Census Bureau, which calculates population estimates every year based on surveys that they conduct. (The most robust count is its decennial census, but they are still able to get useful estimates in the interim years by factoring in smaller surveys conducted on an annual basis.)

On this page, the Bureau provides estimates for the year 2019, which is close enough to the time period covered by the police_shootings_by_state dataset we just created. We can load a simplified extract of those data here:

population_data <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/census_population_estimate_2019.csv")
## Rows: 57 Columns: 7

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): REGION, DIVISION, NAME
## dbl (4): SUMLEV, STATE, CENSUS2010POP, POPESTIMATE2019

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here’s what that Census dataset looks like.

head(population_data)
SUMLEV REGION DIVISIO N STATE NAME CENSUS2 010POP POPESTI MATE201 9
10 0 0 0 United States 3087455 38 3282395 23
20 1 0 0 Northea st Region 5531724 0 5598280 3
20 2 0 0 Midwest Region 6692700 1 6832900 4
20 3 0 0 South Region 1145557 44 1255804 48
20 4 0 0 West Region 7194555 3 7834726 8
40 3 6 1 Alabama 4779736 4903185

The Census dataset gives us the more-accurate (at that time) 2010 Census population count (CENSUS2010POP) as well as the more-recent 2019 population estimate (POPESTIMATE2019) for each state.

I could thus combine these data with those in my police_shootings_by_state data frame to have both the number of incidents and the 2019 population estimate for each state. This would allow me to calculate the per capita rates.

However, in order to combine those data, I need a variable appearing in both datasets that would allow me to align the data correctly (i.e., ensure the frequency of shootings for Alabama correctly matches up with its population estimate).

In this case, we do have a common identifier in both datasets: the name of the state. This information can be found in the state variable of the police_shootings_by_state data frame and in the NAME variable of the population_data data frame.

However, we have a small problem: the police_shootings_by_state data use state codes (e.g., AL) while our population_data data use full state names (e.g., Alabama).

Using intermediary datasets

In order to connect those two data frames, we’ll need to introduce a third set of data that simply serves as an intermediary to help link the codes with their full names. We sometimes call these kinds of datasets translation tables.

In this case, I can simply use a translation table from this source. I would start by loading it in:

state_translation_table <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/state_code_name_table.csv")
## Rows: 51 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): State, Abbrev, Code

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here’s what that data frame looks like:

head(state_translation_table)
State Abbrev Code
Alabama Ala. AL
Alaska Alaska AK
Arizona Ariz. AZ
Arkansas Ark. AR
California Calif. CA
Colorado Colo. CO

As you can see, we have three variables in this data frame, each representing a different way of identifying a state (observation).

Joining the data

In order to join our data, I thus need to complete two steps.

First step

The first step is to combine my police_shootings_by_state data frame with my state_translation_table data frame in order to have an identifier I can match to my eventual population_data data frame.

Since I only care about the states that are present in my police_shootings_by_state data frame, I should perform a left_join().

It’s important to note that our common variable (the state code) is represented by a different variable in each dataset (state in police_shootings_by_state and Code in state_translation_table). We’ll thus have to modify our by argument to use a new kind of syntax:

police_shootings_by_state_name <- left_join(x=police_shootings_by_state, y=state_translation_table, by=c("state"="Code"))

An alternative, but less efficient, approach would be to rename the columns in the two data frames prior to joining them. Instead, this syntax simply tells us that the first argument in our vector is equal to the second argument—that is, the information in the state variable from the x argument data frame is equivalent to the Code variable in the y argument data frame.

Here’s what our resulting data frame looks like:

head(police_shootings_by_state_name)
state n State Abbrev
CA 144 California Calif.
FL 92 Florida Fla.
TX 81 Texas Tex.
AZ 44 Arizona Ariz.
CO 42 Colorado Colo.
GA 40 Georgia Ga.

Notice that we have our original state and n variables, as well as the new State and Abbrev variables. (The Code variable from our state_translation_table data frame was dropped in favor of the state variable from our primary dataset.)

Second step

Then, my second step would be to combine my resulting data frame (police_shootings_by_state_name) with the data frame that contains my population information (population_data).

I will again use left_join() since I only care about the states for which I have data. However, I will change my key (by variables) since I am now connecting the full state names. As was the case in the earlier step, we have different variable names for the same information, so we have to adjust our by argument accordingly:

police_shootings_by_state_population <- left_join(x=police_shootings_by_state_name, y=population_data, by=c("State"="NAME"))

Here is what our resulting data frame looks like:

head(police_shootings_by_state_population)
state n State Abbrev SUMLEV REGION DIVISION STATE CENSUS2010POP POPESTIMATE2019
CA 144 California Calif. 40 4 9 6 37253956 39512223
FL 92 Florida Fla. 40 3 5 12 18801310 21477737
TX 81 Texas Tex. 40 3 7 48 25145561 28995881
AZ 44 Arizona Ariz. 40 4 8 4 6392017 7278717
CO 42 Colorado Colo. 40 4 8 8 5029196 5758736
GA 40 Georgia Ga. 40 3 5 13 9687653 10617423
Calculating the per capita rate

Now that I have a single data frame with information about the number of incidents in each state as well as that state’s 2019 population estimate, I can easily compute a new variable with the per capita rate using the mutate() function.

Here’s how I might compute a new variable called (per_100000) that contains the number of police shootings per 100,000 residents (based on the 2019 estimates), and sort the resulting data frame in ascending (low to high) order based on that new variable.

police_shootings_by_state_population %>%
  select(State, n, POPESTIMATE2019) %>%
  mutate(per_100000=n/POPESTIMATE2019*100000) %>%
  arrange(per_100000)
State n POPESTIMATE2019 per_100000
Massachusetts 6 6892503 0.0870511
New Jersey 8 8882190 0.0900679
New York 19 19453561 0.0976685
Illinois 13 12671821 0.1025898
Connecticut 5 3565287 0.1402412
Hawaii 2 1415872 0.1412557
District of Columbia 1 705749 0.1416934
Minnesota 8 5639632 0.1418532
Pennsylvania 19 12801989 0.1484144
Michigan 16 9986857 0.1602106
Wyoming 1 578759 0.1727835
Virginia 15 8535519 0.1757362
Iowa 6 3155070 0.1901701
Ohio 23 11689100 0.1967645
Delaware 2 973764 0.2053886
Nebraska 4 1934408 0.2067816
South Dakota 2 884659 0.2260758
Maryland 14 6045680 0.2315703
Kansas 7 2913314 0.2402762
South Carolina 14 5148714 0.2719126
Texas 81 28995881 0.2793500
Wisconsin 17 5822434 0.2919741
New Hampshire 4 1359711 0.2941802
Maine 4 1344212 0.2975721
North Carolina 34 10488084 0.3241774
Alabama 16 4903185 0.3263185
West Virginia 6 1792147 0.3347940
Idaho 6 1787065 0.3357460
Mississippi 10 2976149 0.3360047
Oregon 15 4217737 0.3556410
Indiana 24 6732219 0.3564946
California 144 39512223 0.3644442
Arkansas 11 3017804 0.3645035
Georgia 40 10617423 0.3767393
North Dakota 3 762062 0.3936688
Missouri 25 6137428 0.4073368
Tennessee 28 6829174 0.4100057
Washington 32 7614893 0.4202291
Florida 92 21477737 0.4283505
Louisiana 20 4648794 0.4302191
Kentucky 20 4467673 0.4476603
Utah 17 3205958 0.5302627
Arizona 44 7278717 0.6045021
Oklahoma 26 3956971 0.6570682
Colorado 42 5758736 0.7293267
New Mexico 16 2096829 0.7630570
Nevada 25 3080156 0.8116472
Montana 10 1068778 0.9356480
Alaska 9 731545 1.2302729

Now we can see that although Massachusetts was tied for 11th in the lowest frequency of fatal police shootings across U.S. states, it actually has the lowest rate of such shootings. Well done, Massachusetts!

Doing it all in one step

While we performed all of our calculations by creating intermediate objects, it’s also worth noting that we could have done all of the above in a single step:

police_shootings %>%
  filter(date>="2020-01-01" & date<="2020-12-31") %>%
  group_by(state) %>%
  count() %>%
  ungroup() %>%
  left_join(y=state_translation_table, by=c("state"="Code")) %>%
  left_join(y=population_data, by=c("State"="NAME")) %>%
  select(State, n, POPESTIMATE2019) %>%
  mutate(per_100000=n/POPESTIMATE2019*100000) %>%
  arrange(per_100000)
State n POPESTIMATE2019 per_100000
Massachusetts 6 6892503 0.0870511
New Jersey 8 8882190 0.0900679
New York 19 19453561 0.0976685
Illinois 13 12671821 0.1025898
Connecticut 5 3565287 0.1402412
Hawaii 2 1415872 0.1412557
District of Columbia 1 705749 0.1416934
Minnesota 8 5639632 0.1418532
Pennsylvania 19 12801989 0.1484144
Michigan 16 9986857 0.1602106
Wyoming 1 578759 0.1727835
Virginia 15 8535519 0.1757362
Iowa 6 3155070 0.1901701
Ohio 23 11689100 0.1967645
Delaware 2 973764 0.2053886
Nebraska 4 1934408 0.2067816
South Dakota 2 884659 0.2260758
Maryland 14 6045680 0.2315703
Kansas 7 2913314 0.2402762
South Carolina 14 5148714 0.2719126
Texas 81 28995881 0.2793500
Wisconsin 17 5822434 0.2919741
New Hampshire 4 1359711 0.2941802
Maine 4 1344212 0.2975721
North Carolina 34 10488084 0.3241774
Alabama 16 4903185 0.3263185
West Virginia 6 1792147 0.3347940
Idaho 6 1787065 0.3357460
Mississippi 10 2976149 0.3360047
Oregon 15 4217737 0.3556410
Indiana 24 6732219 0.3564946
California 144 39512223 0.3644442
Arkansas 11 3017804 0.3645035
Georgia 40 10617423 0.3767393
North Dakota 3 762062 0.3936688
Missouri 25 6137428 0.4073368
Tennessee 28 6829174 0.4100057
Washington 32 7614893 0.4202291
Florida 92 21477737 0.4283505
Louisiana 20 4648794 0.4302191
Kentucky 20 4467673 0.4476603
Utah 17 3205958 0.5302627
Arizona 44 7278717 0.6045021
Oklahoma 26 3956971 0.6570682
Colorado 42 5758736 0.7293267
New Mexico 16 2096829 0.7630570
Nevada 25 3080156 0.8116472
Montana 10 1068778 0.9356480
Alaska 9 731545 1.2302729

This assumes, of course, that we have already read in the data from the three CSV files into their appropriate objects (e.g., state_translation_table).

This helps to keep our working environment clean and ensures that we are always working from the same, unmodified data. (It can be easy to forget which changes you made to the data as you create new intermediary objects, and those changes can affect your calculations going forward.)