Home > Finding Data > Combining Data in R

Combining Data in R

Introduction

In this tutorial, we will be learning about a few different functions that can be used to combine two (or more) datasets within R. This is useful when you want to supplement your existing data with additional data in order to compute new variables that are relevant to your story.

Specifically, we will be covering the following functions from the tidyverse metapackage:

left_join()
right_join()
full_join()
inner_join()

We will then illustrate the utility of those functions by combining data from The Washington Post and the U.S. Census Bureau in order to calculate a per capita rate that could not be computed with a single dataset.

The Different Kinds of Join

The dplyr package (part of tidyverse) provides us with four different options for joining data. Simply put, a join operation takes the variables from two datasets and creates a new dataset that includes observations from the two datasets.

However, there are different ways to add those variables to the dataset. That is, depending on the type of join you use, you will get a resulting dataset that includes a different combination of observations. (If you are familiar with SQL joins, you will see that R generally handles them in similar ways.)

There are three key elements to any kind of join:

The left data frame (also called the primary or target data frame).
The right data frame (also called the secondary or supplemental data frame).
The key variable(s) (a variable, or set of variables, that appears in both data frames and allows you to correctly associate the observations/rows between the two data frames).

If this sounds confusing at first, rest assured that it will make more sense as we cover each join.

Our sample data

For demonstrative purposes, we are going to draw from two data frames.

The first one, my_pokedex, includes the following information about the pokémon I have collected in my travels:

my_pokedex

number	name	type
6	Charizard	Fire
25	Pikachu	Electric
37	Vulpix	Fire
58	Growlithe	Fire
65	Alakazam	Psychic
145	Zapdos	Electric

The second data frame, pokemon_types, includes information about the strengths and weakenesses of three types of pokémon.

pokemon_types

type	strengths	weaknesses
Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
Electric	Water, Flying	Ground
Water	Fire, Ground, Rock	Electric, Grass

For demonstrative purposes, my_pokedex will be the primary/target data frame (it will appear as the x argument in my *_join functions) and pokemon_types will be the secondary/supplemental data frame (the y argument).

The key/matching variable that is common to both data frames is type (the by argument).

Don’t forget to load `tidyverse`

Remember, all of our *_join() functions come from the dplyr package (that is loaded by the tidyverse metapackage). We have to make sure we load it or we will get errors telling us that the function cannot be found.

library(tidyverse)

The left join

The most common way to combine two datasets is to use the left_join() function. With this type of join, you will keep all the observations in the primary data frame and drop all of the observations in the secondary data frame that are not matched.

Here is what happens when I perform a left join of my two data frames:

left_join(x=my_pokedex, y=pokemon_types, by="type")

number	name	type	strengths	weaknesses
6	Charizard	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
25	Pikachu	Electric	Water, Flying	Ground
37	Vulpix	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
58	Growlithe	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
65	Alakazam	Psychic	NA	NA
145	Zapdos	Electric	Water, Flying	Ground

The *_join() functions expect the key(s) (by argument) to be supplied as a string (or a vector of strings). Put another way, we need to enclose them in quotation marks or it will give us an error.

Notice that it kept all of my six observations from my_pokedex and added two variables (strengths and weaknesses) from my pokemon_types data frame. Every time there was a “Fire” or “Electric” type pokémon, it appended the corresponding information from the pokemon_types data frame.

However, since there was no row in my pokemon_types data frame with the value “Psychic” under type, it appended missing values (NA) under the new variables for that observation. Similarly, since there was no row in my my_pokedex data frame that had the value “Water” under the variable type, that corresponding information from pokemon_types was dropped (i.e., it did not append any of our “Water” strengths and weaknesses).

The right join

The right_join() function works in similar fashion to left_join(), except it includes all observations from the secondary data frame (including those that have no matching value for the primary data frame) and drops the non-matched observations from the primary data frame.

Here is what happens when I perform a right join of my two data frames:

right_join(x=my_pokedex, y=pokemon_types, by="type")

number	name	type	strengths	weaknesses
6	Charizard	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
25	Pikachu	Electric	Water, Flying	Ground
37	Vulpix	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
58	Growlithe	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
145	Zapdos	Electric	Water, Flying	Ground
NA	NA	Water	Fire, Ground, Rock	Electric, Grass

Notice that I no longer have an observation for the “Psychic” pokémon originally found in the my_pokedex data frame. I do, however, have an observation for the “Water” pokémon from the pokemon_types data frame (which naturally has NA values for the first two variables, as there were no such pokémon in my my_pokedex data frame).

The full join

The full_join() function is useful when you want to create a resulting data frame that includes all observations from both the primary and secondary data frames. Whenever there are non-matching observations from either data frame, it will fill in those values as NA.

Here is what happens when I perform a full join of my two data frames:

full_join(x=my_pokedex, y=pokemon_types, by="type")

number	name	type	strengths	weaknesses
6	Charizard	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
25	Pikachu	Electric	Water, Flying	Ground
37	Vulpix	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
58	Growlithe	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
65	Alakazam	Psychic	NA	NA
145	Zapdos	Electric	Water, Flying	Ground
NA	NA	Water	Fire, Ground, Rock	Electric, Grass

Notice that the resulting dataframe includes all possible type values (i.e., from either data frame). For the “Psychic” type that is only present in the my_pokedex data frame, it appends NA values to the variables from the pokemon_types data frame. For the “Water” type that is only present in the pokemon_types data frame, it appends NA values to the variables from the my_pokedex data frame.

The inner join

Our final join function is the inner_join(), which is useful when you want to create a resulting data frame that includes only the observations found in both the primary and secondary data frames. Any observations that are not common to the two datasets will be dropped.

Here is what happens when I perform a full join of my two data frames:

inner_join(x=my_pokedex, y=pokemon_types, by="type")

number	name	type	strengths	weaknesses
6	Charizard	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
25	Pikachu	Electric	Water, Flying	Ground
37	Vulpix	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
58	Growlithe	Fire	Grass, Ice, Bug, Steel	Water, Ground, Rock
145	Zapdos	Electric	Water, Flying	Ground

Notice that we do not have any observations for either the “Psychic” pokémon that was only present in the my_pokedex data frame or the “Water” pokémon that was only present in the pokemon_types data frame.

Other useful filtering joins

There are two other potentially useful filtering joins for comparing dataframes. These are less commonly used in practice but can nevertheless be useful for finding the presence and absence of matches.

The first is the semi_join(), which only returns the rows from the primary data frame that match those in the secondary data frame (without adding any new variables or values):

semi_join(x=my_pokedex, y=pokemon_types, by="type")

number	name	type
6	Charizard	Fire
25	Pikachu	Electric
37	Vulpix	Fire
58	Growlithe	Fire
145	Zapdos	Electric

The second is the anti_join(), which only returns the rows from the primary data frame that do not match those in the secondary data frame (again, without adding any new variables or values).

anti_join(x=my_pokedex, y=pokemon_types, by="type")

number	name	type
65	Alakazam	Psychic

Combining Police Shootings Data

To illustrate the value of combining datasets in the real world, we’ll start by using data from the award-winning Washington Post database on fatal police shootings. These data can be loaded here:

police_shootings <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/fatal_police_shootings_2020.csv")

## Rows: 5935 Columns: 17

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): name, manner_of_death, armed, gender, race, city, state, threat_le...
## dbl  (4): id, age, longitude, latitude
## lgl  (3): signs_of_mental_illness, body_camera, is_geocoding_exact
## date (1): date

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here’s what these data look like:

head(police_shootings)

id	name	date	manner_of_death	armed	age	gender	race	city	state	signs_of_mental_illness	threat_level	flee	body_camera	longitude	latitude	is_geocoding_exact
3	Tim Elliot	2015-01-02	shot	gun	53	M	A	Shelton	WA	TRUE	attack	Not fleeing	FALSE	-123.122	47.247	TRUE
4	Lewis Lee Lembke	2015-01-02	shot	gun	47	M	W	Aloha	OR	FALSE	attack	Not fleeing	FALSE	-122.892	45.487	TRUE
5	John Paul Quintero	2015-01-03	shot and Tasered	unarmed	23	M	H	Wichita	KS	FALSE	other	Not fleeing	FALSE	-97.281	37.695	TRUE
8	Matthew Hoffman	2015-01-04	shot	toy weapon	32	M	W	San Francisco	CA	TRUE	attack	Not fleeing	FALSE	-122.422	37.763	TRUE
9	Michael Rodriguez	2015-01-04	shot	nail gun	39	M	H	Evans	CO	FALSE	attack	Not fleeing	FALSE	-104.692	40.384	TRUE
11	Kenneth Joe Brown	2015-01-04	shot	gun	18	M	W	Guthrie	OK	FALSE	attack	Not fleeing	FALSE	-97.423	35.877	TRUE

As you can see, we have a different observation for each time an individual was killed by police. Each observation includes useful information, including the date of the incident (date), the victim’s gender (gender) and race/ethnicity (race), and the location of the incident (city and state).

With these data, we could easily write a story about the number of incidents in which a police officer has killed an unarmed person.

As a journalist in Massachusetts, I may decide that I want to write a story examining the number of fatal shootings that occurred in the Commonwealth in 2020, and compare that frequency with those of other states to get a sense of where Massachusetts ranks when it comes to this issue.

Here’s one way to get that information:

police_shootings_by_state <- police_shootings %>%
  filter(date>="2020-01-01" & date<="2020-12-31") %>%
  group_by(state) %>%
  count() %>%
  arrange(desc(n)) %>%
  ungroup()

I am storing the information in a new data frame object called police_shootings_by_state so I can quickly reference these data later.

Here’s the result of my operation:

police_shootings_by_state

state	n
CA	144
FL	92
TX	81
AZ	44
CO	42
GA	40
NC	34
WA	32
TN	28
OK	26
MO	25
NV	25
IN	24
OH	23
KY	20
LA	20
NY	19
PA	19
UT	17
WI	17
AL	16
MI	16
NM	16
OR	15
VA	15
MD	14
SC	14
IL	13
AR	11
MS	10
MT	10
AK	9
MN	8
NJ	8
KS	7
IA	6
ID	6
MA	6
WV	6
CT	5
ME	4
NE	4
NH	4
ND	3
DE	2
HI	2
SD	2
DC	1
WY	1

As we can see, California, Florida, and Texas rank as having the most fatal police shootings. Conversely, the District of Columbia, Wyoming, and Delaware had the fewest. (Massachusetts is tied for 11th in the country.)

Immediately, I see that the states with the most incidents are among the most populous ones in the country, and the ones with the fewest are the least populous states. To make a more apples-to-apples comparison, it would be helpful to calculate a per capita rate that takes the state’s population size into account.

Getting population data

In order to do that, though, I need data about the population size in each state. Unfortunately, my police_shootings data do not contain such information. I thus need to combine it with an additional dataset that contains population information for each state.

One such data source would be the U.S. Census Bureau, which calculates population estimates every year based on surveys that they conduct. (The most robust count is its decennial census, but they are still able to get useful estimates in the interim years by factoring in smaller surveys conducted on an annual basis.)

On this page, the Bureau provides estimates for the year 2019, which is close enough to the time period covered by the police_shootings_by_state dataset we just created. We can load a simplified extract of those data here:

population_data <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/census_population_estimate_2019.csv")

## Rows: 57 Columns: 7

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): REGION, DIVISION, NAME
## dbl (4): SUMLEV, STATE, CENSUS2010POP, POPESTIMATE2019

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here’s what that Census dataset looks like.

head(population_data)

SUMLEV	REGION	DIVISIO N	STATE	NAME	CENSUS2 010POP	POPESTI MATE201 9
10	0	0	0	United States	3087455 38	3282395 23
20	1	0	0	Northea st Region	5531724 0	5598280 3
20	2	0	0	Midwest Region	6692700 1	6832900 4
20	3	0	0	South Region	1145557 44	1255804 48
20	4	0	0	West Region	7194555 3	7834726 8
40	3	6	1	Alabama	4779736	4903185

The Census dataset gives us the more-accurate (at that time) 2010 Census population count (CENSUS2010POP) as well as the more-recent 2019 population estimate (POPESTIMATE2019) for each state.

I could thus combine these data with those in my police_shootings_by_state data frame to have both the number of incidents and the 2019 population estimate for each state. This would allow me to calculate the per capita rates.

However, in order to combine those data, I need a variable appearing in both datasets that would allow me to align the data correctly (i.e., ensure the frequency of shootings for Alabama correctly matches up with its population estimate).

In this case, we do have a common identifier in both datasets: the name of the state. This information can be found in the state variable of the police_shootings_by_state data frame and in the NAME variable of the population_data data frame.

However, we have a small problem: the police_shootings_by_state data use state codes (e.g., AL) while our population_data data use full state names (e.g., Alabama).

Using intermediary datasets

In order to connect those two data frames, we’ll need to introduce a third set of data that simply serves as an intermediary to help link the codes with their full names. We sometimes call these kinds of datasets translation tables.

In this case, I can simply use a translation table from this source. I would start by loading it in:

state_translation_table <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/state_code_name_table.csv")

## Rows: 51 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): State, Abbrev, Code

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here’s what that data frame looks like:

head(state_translation_table)

State	Abbrev	Code
Alabama	Ala.	AL
Alaska	Alaska	AK
Arizona	Ariz.	AZ
Arkansas	Ark.	AR
California	Calif.	CA
Colorado	Colo.	CO

As you can see, we have three variables in this data frame, each representing a different way of identifying a state (observation).

Joining the data

In order to join our data, I thus need to complete two steps.

First step

The first step is to combine my police_shootings_by_state data frame with my state_translation_table data frame in order to have an identifier I can match to my eventual population_data data frame.

Since I only care about the states that are present in my police_shootings_by_state data frame, I should perform a left_join().

It’s important to note that our common variable (the state code) is represented by a different variable in each dataset (state in police_shootings_by_state and Code in state_translation_table). We’ll thus have to modify our by argument to use a new kind of syntax:

police_shootings_by_state_name <- left_join(x=police_shootings_by_state, y=state_translation_table, by=c("state"="Code"))

An alternative, but less efficient, approach would be to rename the columns in the two data frames prior to joining them. Instead, this syntax simply tells us that the first argument in our vector is equal to the second argument—that is, the information in the state variable from the x argument data frame is equivalent to the Code variable in the y argument data frame.

Here’s what our resulting data frame looks like:

head(police_shootings_by_state_name)

state	n	State	Abbrev
CA	144	California	Calif.
FL	92	Florida	Fla.
TX	81	Texas	Tex.
AZ	44	Arizona	Ariz.
CO	42	Colorado	Colo.
GA	40	Georgia	Ga.

Notice that we have our original state and n variables, as well as the new State and Abbrev variables. (The Code variable from our state_translation_table data frame was dropped in favor of the state variable from our primary dataset.)

Second step

Then, my second step would be to combine my resulting data frame (police_shootings_by_state_name) with the data frame that contains my population information (population_data).

I will again use left_join() since I only care about the states for which I have data. However, I will change my key (by variables) since I am now connecting the full state names. As was the case in the earlier step, we have different variable names for the same information, so we have to adjust our by argument accordingly:

police_shootings_by_state_population <- left_join(x=police_shootings_by_state_name, y=population_data, by=c("State"="NAME"))

Here is what our resulting data frame looks like:

head(police_shootings_by_state_population)

state	n	State	Abbrev	SUMLEV	REGION	DIVISION	STATE	CENSUS2010POP	POPESTIMATE2019
CA	144	California	Calif.	40	4	9	6	37253956	39512223
FL	92	Florida	Fla.	40	3	5	12	18801310	21477737
TX	81	Texas	Tex.	40	3	7	48	25145561	28995881
AZ	44	Arizona	Ariz.	40	4	8	4	6392017	7278717
CO	42	Colorado	Colo.	40	4	8	8	5029196	5758736
GA	40	Georgia	Ga.	40	3	5	13	9687653	10617423

Calculating the per capita rate

Now that I have a single data frame with information about the number of incidents in each state as well as that state’s 2019 population estimate, I can easily compute a new variable with the per capita rate using the mutate() function.

Here’s how I might compute a new variable called (per_100000) that contains the number of police shootings per 100,000 residents (based on the 2019 estimates), and sort the resulting data frame in ascending (low to high) order based on that new variable.

police_shootings_by_state_population %>%
  select(State, n, POPESTIMATE2019) %>%
  mutate(per_100000=n/POPESTIMATE2019*100000) %>%
  arrange(per_100000)

State	n	POPESTIMATE2019	per_100000
Massachusetts	6	6892503	0.0870511
New Jersey	8	8882190	0.0900679
New York	19	19453561	0.0976685
Illinois	13	12671821	0.1025898
Connecticut	5	3565287	0.1402412
Hawaii	2	1415872	0.1412557
District of Columbia	1	705749	0.1416934
Minnesota	8	5639632	0.1418532
Pennsylvania	19	12801989	0.1484144
Michigan	16	9986857	0.1602106
Wyoming	1	578759	0.1727835
Virginia	15	8535519	0.1757362
Iowa	6	3155070	0.1901701
Ohio	23	11689100	0.1967645
Delaware	2	973764	0.2053886
Nebraska	4	1934408	0.2067816
South Dakota	2	884659	0.2260758
Maryland	14	6045680	0.2315703
Kansas	7	2913314	0.2402762
South Carolina	14	5148714	0.2719126
Texas	81	28995881	0.2793500
Wisconsin	17	5822434	0.2919741
New Hampshire	4	1359711	0.2941802
Maine	4	1344212	0.2975721
North Carolina	34	10488084	0.3241774
Alabama	16	4903185	0.3263185
West Virginia	6	1792147	0.3347940
Idaho	6	1787065	0.3357460
Mississippi	10	2976149	0.3360047
Oregon	15	4217737	0.3556410
Indiana	24	6732219	0.3564946
California	144	39512223	0.3644442
Arkansas	11	3017804	0.3645035
Georgia	40	10617423	0.3767393
North Dakota	3	762062	0.3936688
Missouri	25	6137428	0.4073368
Tennessee	28	6829174	0.4100057
Washington	32	7614893	0.4202291
Florida	92	21477737	0.4283505
Louisiana	20	4648794	0.4302191
Kentucky	20	4467673	0.4476603
Utah	17	3205958	0.5302627
Arizona	44	7278717	0.6045021
Oklahoma	26	3956971	0.6570682
Colorado	42	5758736	0.7293267
New Mexico	16	2096829	0.7630570
Nevada	25	3080156	0.8116472
Montana	10	1068778	0.9356480
Alaska	9	731545	1.2302729

Now we can see that although Massachusetts was tied for 11th in the lowest frequency of fatal police shootings across U.S. states, it actually has the lowest rate of such shootings. Well done, Massachusetts!

Doing it all in one step

While we performed all of our calculations by creating intermediate objects, it’s also worth noting that we could have done all of the above in a single step:

police_shootings %>%
  filter(date>="2020-01-01" & date<="2020-12-31") %>%
  group_by(state) %>%
  count() %>%
  ungroup() %>%
  left_join(y=state_translation_table, by=c("state"="Code")) %>%
  left_join(y=population_data, by=c("State"="NAME")) %>%
  select(State, n, POPESTIMATE2019) %>%
  mutate(per_100000=n/POPESTIMATE2019*100000) %>%
  arrange(per_100000)

State	n	POPESTIMATE2019	per_100000
Massachusetts	6	6892503	0.0870511
New Jersey	8	8882190	0.0900679
New York	19	19453561	0.0976685
Illinois	13	12671821	0.1025898
Connecticut	5	3565287	0.1402412
Hawaii	2	1415872	0.1412557
District of Columbia	1	705749	0.1416934
Minnesota	8	5639632	0.1418532
Pennsylvania	19	12801989	0.1484144
Michigan	16	9986857	0.1602106
Wyoming	1	578759	0.1727835
Virginia	15	8535519	0.1757362
Iowa	6	3155070	0.1901701
Ohio	23	11689100	0.1967645
Delaware	2	973764	0.2053886
Nebraska	4	1934408	0.2067816
South Dakota	2	884659	0.2260758
Maryland	14	6045680	0.2315703
Kansas	7	2913314	0.2402762
South Carolina	14	5148714	0.2719126
Texas	81	28995881	0.2793500
Wisconsin	17	5822434	0.2919741
New Hampshire	4	1359711	0.2941802
Maine	4	1344212	0.2975721
North Carolina	34	10488084	0.3241774
Alabama	16	4903185	0.3263185
West Virginia	6	1792147	0.3347940
Idaho	6	1787065	0.3357460
Mississippi	10	2976149	0.3360047
Oregon	15	4217737	0.3556410
Indiana	24	6732219	0.3564946
California	144	39512223	0.3644442
Arkansas	11	3017804	0.3645035
Georgia	40	10617423	0.3767393
North Dakota	3	762062	0.3936688
Missouri	25	6137428	0.4073368
Tennessee	28	6829174	0.4100057
Washington	32	7614893	0.4202291
Florida	92	21477737	0.4283505
Louisiana	20	4648794	0.4302191
Kentucky	20	4467673	0.4476603
Utah	17	3205958	0.5302627
Arizona	44	7278717	0.6045021
Oklahoma	26	3956971	0.6570682
Colorado	42	5758736	0.7293267
New Mexico	16	2096829	0.7630570
Nevada	25	3080156	0.8116472
Montana	10	1068778	0.9356480
Alaska	9	731545	1.2302729

This assumes, of course, that we have already read in the data from the three CSV files into their appropriate objects (e.g., state_translation_table).

This helps to keep our working environment clean and ensures that we are always working from the same, unmodified data. (It can be easy to forget which changes you made to the data as you create new intermediary objects, and those changes can affect your calculations going forward.)

state	n
CA	144
FL	92
TX	81
AZ	44
CO	42
GA	40
NC	34
WA	32
TN	28
OK	26
MO	25
NV	25
IN	24
OH	23
KY	20
LA	20
NY	19
PA	19
UT	17
WI	17
AL	16
MI	16
NM	16
OR	15
VA	15
MD	14
SC	14
IL	13
AR	11
MS	10
MT	10
AK	9
MN	8
NJ	8
KS	7
IA	6
ID	6
MA	6
WV	6
CT	5
ME	4
NE	4
NH	4
ND	3
DE	2
HI	2
SD	2
DC	1
WY	1

state	n
CA	144
FL	92
TX	81
AZ	44
CO	42
GA	40
NC	34
WA	32
TN	28
OK	26
MO	25
NV	25
IN	24
OH	23
KY	20
LA	20
NY	19
PA	19
UT	17
WI	17
AL	16
MI	16
NM	16
OR	15
VA	15
MD	14
SC	14
IL	13
AR	11
MS	10
MT	10
AK	9
MN	8
NJ	8
KS	7
IA	6
ID	6
MA	6
WV	6
CT	5
ME	4
NE	4
NH	4
ND	3
DE	2
HI	2
SD	2
DC	1
WY	1

Combining Data in R

Introduction

The Different Kinds of Join

Our sample data

Don’t forget to load tidyverse

The left join

The right join

The full join

The inner join

Other useful filtering joins

Combining Police Shootings Data

Getting population data

Using intermediary datasets

Joining the data

First step

Second step

Calculating the per capita rate

Doing it all in one step

Don’t forget to load `tidyverse`

state	n
CA	144
FL	92
TX	81
AZ	44
CO	42
GA	40
NC	34
WA	32
TN	28
OK	26
MO	25
NV	25
IN	24
OH	23
KY	20
LA	20
NY	19
PA	19
UT	17
WI	17
AL	16
MI	16
NM	16
OR	15
VA	15
MD	14
SC	14
IL	13
AR	11
MS	10
MT	10
AK	9
MN	8
NJ	8
KS	7
IA	6
ID	6
MA	6
WV	6
CT	5
ME	4
NE	4
NH	4
ND	3
DE	2
HI	2
SD	2
DC	1
WY	1