In this tutorial, we will be learning about a few different functions that can be used to combine two (or more) datasets within R. This is useful when you want to supplement your existing data with additional data in order to compute new variables that are relevant to your story.
Specifically, we will be covering the following functions from the tidyverse
metapackage:
left_join()
right_join()
full_join()
inner_join()
We will then illustrate the utility of those functions by combining data from The Washington Post and the U.S. Census Bureau in order to calculate a per capita rate that could not be computed with a single dataset.
The dplyr
package (part of tidyverse
) provides us with four different options for joining data. Simply put, a join operation takes the variables from two datasets and creates a new dataset that includes observations from the two datasets.
However, there are different ways to add those variables to the dataset. That is, depending on the type of join you use, you will get a resulting dataset that includes a different combination of observations. (If you are familiar with SQL joins, you will see that R generally handles them in similar ways.)
There are three key elements to any kind of join:
The left data frame (also called the primary or target data frame).
The right data frame (also called the secondary or supplemental data frame).
The key variable(s) (a variable, or set of variables, that appears in both data frames and allows you to correctly associate the observations/rows between the two data frames).
If this sounds confusing at first, rest assured that it will make more sense as we cover each join.
For demonstrative purposes, we are going to draw from two data frames.
The first one, my_pokedex
, includes the following information about the pokémon I have collected in my travels:
my_pokedex
number | name | type |
---|---|---|
6 | Charizard | Fire |
25 | Pikachu | Electric |
37 | Vulpix | Fire |
58 | Growlithe | Fire |
65 | Alakazam | Psychic |
145 | Zapdos | Electric |
The second data frame, pokemon_types
, includes information about the strengths and weakenesses of three types of pokémon.
pokemon_types
type | strengths | weaknesses |
---|---|---|
Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
Electric | Water, Flying | Ground |
Water | Fire, Ground, Rock | Electric, Grass |
For demonstrative purposes, my_pokedex
will be the primary/target data frame (it will appear as the x
argument in my *_join
functions) and pokemon_types
will be the secondary/supplemental data frame (the y
argument).
The key/matching variable that is common to both data frames is type
(the by
argument).
tidyverse
Remember, all of our *_join()
functions come from the dplyr
package (that is loaded by the tidyverse
metapackage). We have to make sure we load it or we will get errors telling us that the function cannot be found.
library(tidyverse)
The most common way to combine two datasets is to use the left_join()
function. With this type of join, you will keep all the observations in the primary data frame and drop all of the observations in the secondary data frame that are not matched.
Here is what happens when I perform a left join of my two data frames:
left_join(x=my_pokedex, y=pokemon_types, by="type")
number | name | type | strengths | weaknesses |
---|---|---|---|---|
6 | Charizard | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
25 | Pikachu | Electric | Water, Flying | Ground |
37 | Vulpix | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
58 | Growlithe | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
65 | Alakazam | Psychic | NA | NA |
145 | Zapdos | Electric | Water, Flying | Ground |
The *_join()
functions expect the key(s) (by
argument) to be supplied as a string (or a vector of strings). Put another way, we need to enclose them in quotation marks or it will give us an error.
Notice that it kept all of my six observations from my_pokedex
and added two variables (strengths
and weaknesses
) from my pokemon_types
data frame. Every time there was a “Fire” or “Electric” type
pokémon, it appended the corresponding information from the pokemon_types
data frame.
However, since there was no row in my pokemon_types
data frame with the value “Psychic” under type
, it appended missing values (NA
) under the new variables for that observation. Similarly, since there was no row in my my_pokedex
data frame that had the value “Water” under the variable type
, that corresponding information from pokemon_types
was dropped (i.e., it did not append any of our “Water” strengths and weaknesses).
The right_join()
function works in similar fashion to left_join()
, except it includes all observations from the secondary data frame (including those that have no matching value for the primary data frame) and drops the non-matched observations from the primary data frame.
Here is what happens when I perform a right join of my two data frames:
right_join(x=my_pokedex, y=pokemon_types, by="type")
number | name | type | strengths | weaknesses |
---|---|---|---|---|
6 | Charizard | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
25 | Pikachu | Electric | Water, Flying | Ground |
37 | Vulpix | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
58 | Growlithe | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
145 | Zapdos | Electric | Water, Flying | Ground |
NA | NA | Water | Fire, Ground, Rock | Electric, Grass |
Notice that I no longer have an observation for the “Psychic” pokémon originally found in the my_pokedex
data frame. I do, however, have an observation for the “Water” pokémon from the pokemon_types
data frame (which naturally has NA
values for the first two variables, as there were no such pokémon in my my_pokedex
data frame).
The full_join()
function is useful when you want to create a resulting data frame that includes all observations from both the primary and secondary data frames. Whenever there are non-matching observations from either data frame, it will fill in those values as NA
.
Here is what happens when I perform a full join of my two data frames:
full_join(x=my_pokedex, y=pokemon_types, by="type")
number | name | type | strengths | weaknesses |
---|---|---|---|---|
6 | Charizard | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
25 | Pikachu | Electric | Water, Flying | Ground |
37 | Vulpix | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
58 | Growlithe | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
65 | Alakazam | Psychic | NA | NA |
145 | Zapdos | Electric | Water, Flying | Ground |
NA | NA | Water | Fire, Ground, Rock | Electric, Grass |
Notice that the resulting dataframe includes all possible type
values (i.e., from either data frame). For the “Psychic” type that is only present in the my_pokedex
data frame, it appends NA
values to the variables from the pokemon_types
data frame. For the “Water” type that is only present in the pokemon_types
data frame, it appends NA
values to the variables from the my_pokedex
data frame.
Our final join function is the inner_join()
, which is useful when you want to create a resulting data frame that includes only the observations found in both the primary and secondary data frames. Any observations that are not common to the two datasets will be dropped.
Here is what happens when I perform a full join of my two data frames:
inner_join(x=my_pokedex, y=pokemon_types, by="type")
number | name | type | strengths | weaknesses |
---|---|---|---|---|
6 | Charizard | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
25 | Pikachu | Electric | Water, Flying | Ground |
37 | Vulpix | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
58 | Growlithe | Fire | Grass, Ice, Bug, Steel | Water, Ground, Rock |
145 | Zapdos | Electric | Water, Flying | Ground |
Notice that we do not have any observations for either the “Psychic” pokémon that was only present in the my_pokedex
data frame or the “Water” pokémon that was only present in the pokemon_types
data frame.
There are two other potentially useful filtering joins for comparing dataframes. These are less commonly used in practice but can nevertheless be useful for finding the presence and absence of matches.
The first is the semi_join()
, which only returns the rows from the primary data frame that match those in the secondary data frame (without adding any new variables or values):
semi_join(x=my_pokedex, y=pokemon_types, by="type")
number | name | type |
---|---|---|
6 | Charizard | Fire |
25 | Pikachu | Electric |
37 | Vulpix | Fire |
58 | Growlithe | Fire |
145 | Zapdos | Electric |
The second is the anti_join()
, which only returns the rows from the primary data frame that do not match those in the secondary data frame (again, without adding any new variables or values).
anti_join(x=my_pokedex, y=pokemon_types, by="type")
number | name | type |
---|---|---|
65 | Alakazam | Psychic |
To illustrate the value of combining datasets in the real world, we’ll start by using data from the award-winning Washington Post database on fatal police shootings. These data can be loaded here:
police_shootings <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/fatal_police_shootings_2020.csv")
## Rows: 5935 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): name, manner_of_death, armed, gender, race, city, state, threat_le...
## dbl (4): id, age, longitude, latitude
## lgl (3): signs_of_mental_illness, body_camera, is_geocoding_exact
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here’s what these data look like:
head(police_shootings)
id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | longitude | latitude | is_geocoding_exact |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Tim Elliot | 2015-01-02 | shot | gun | 53 | M | A | Shelton | WA | TRUE | attack | Not fleeing | FALSE | -123.122 | 47.247 | TRUE |
4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47 | M | W | Aloha | OR | FALSE | attack | Not fleeing | FALSE | -122.892 | 45.487 | TRUE |
5 | John Paul Quintero | 2015-01-03 | shot and Tasered | unarmed | 23 | M | H | Wichita | KS | FALSE | other | Not fleeing | FALSE | -97.281 | 37.695 | TRUE |
8 | Matthew Hoffman | 2015-01-04 | shot | toy weapon | 32 | M | W | San Francisco | CA | TRUE | attack | Not fleeing | FALSE | -122.422 | 37.763 | TRUE |
9 | Michael Rodriguez | 2015-01-04 | shot | nail gun | 39 | M | H | Evans | CO | FALSE | attack | Not fleeing | FALSE | -104.692 | 40.384 | TRUE |
11 | Kenneth Joe Brown | 2015-01-04 | shot | gun | 18 | M | W | Guthrie | OK | FALSE | attack | Not fleeing | FALSE | -97.423 | 35.877 | TRUE |
As you can see, we have a different observation for each time an individual was killed by police. Each observation includes useful information, including the date of the incident (date
), the victim’s gender (gender
) and race/ethnicity (race
), and the location of the incident (city
and state
).
With these data, we could easily write a story about the number of incidents in which a police officer has killed an unarmed person.
As a journalist in Massachusetts, I may decide that I want to write a story examining the number of fatal shootings that occurred in the Commonwealth in 2020, and compare that frequency with those of other states to get a sense of where Massachusetts ranks when it comes to this issue.
Here’s one way to get that information:
police_shootings_by_state <- police_shootings %>%
filter(date>="2020-01-01" & date<="2020-12-31") %>%
group_by(state) %>%
count() %>%
arrange(desc(n)) %>%
ungroup()
I am storing the information in a new data frame object called police_shootings_by_state
so I can quickly reference these data later.
Here’s the result of my operation:
police_shootings_by_state
state | n |
---|---|
CA | 144 |
FL | 92 |
TX | 81 |
AZ | 44 |
CO | 42 |
GA | 40 |
NC | 34 |
WA | 32 |
TN | 28 |
OK | 26 |
MO | 25 |
NV | 25 |
IN | 24 |
OH | 23 |
KY | 20 |
LA | 20 |
NY | 19 |
PA | 19 |
UT | 17 |
WI | 17 |
AL | 16 |
MI | 16 |
NM | 16 |
OR | 15 |
VA | 15 |
MD | 14 |
SC | 14 |
IL | 13 |
AR | 11 |
MS | 10 |
MT | 10 |
AK | 9 |
MN | 8 |
NJ | 8 |
KS | 7 |
IA | 6 |
ID | 6 |
MA | 6 |
WV | 6 |
CT | 5 |
ME | 4 |
NE | 4 |
NH | 4 |
ND | 3 |
DE | 2 |
HI | 2 |
SD | 2 |
DC | 1 |
WY | 1 |
As we can see, California, Florida, and Texas rank as having the most fatal police shootings. Conversely, the District of Columbia, Wyoming, and Delaware had the fewest. (Massachusetts is tied for 11th in the country.)
Immediately, I see that the states with the most incidents are among the most populous ones in the country, and the ones with the fewest are the least populous states. To make a more apples-to-apples comparison, it would be helpful to calculate a per capita rate that takes the state’s population size into account.
In order to do that, though, I need data about the population size in each state. Unfortunately, my police_shootings
data do not contain such information. I thus need to combine it with an additional dataset that contains population information for each state.
One such data source would be the U.S. Census Bureau, which calculates population estimates every year based on surveys that they conduct. (The most robust count is its decennial census, but they are still able to get useful estimates in the interim years by factoring in smaller surveys conducted on an annual basis.)
On this page, the Bureau provides estimates for the year 2019, which is close enough to the time period covered by the police_shootings_by_state
dataset we just created. We can load a simplified extract of those data here:
population_data <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/census_population_estimate_2019.csv")
## Rows: 57 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): REGION, DIVISION, NAME
## dbl (4): SUMLEV, STATE, CENSUS2010POP, POPESTIMATE2019
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here’s what that Census dataset looks like.
head(population_data)
SUMLEV | REGION | DIVISIO N | STATE | NAME | CENSUS2 010POP | POPESTI MATE201 9 |
---|---|---|---|---|---|---|
10 | 0 | 0 | 0 | United States | 3087455 38 | 3282395 23 |
20 | 1 | 0 | 0 | Northea st Region | 5531724 0 | 5598280 3 |
20 | 2 | 0 | 0 | Midwest Region | 6692700 1 | 6832900 4 |
20 | 3 | 0 | 0 | South Region | 1145557 44 | 1255804 48 |
20 | 4 | 0 | 0 | West Region | 7194555 3 | 7834726 8 |
40 | 3 | 6 | 1 | Alabama | 4779736 | 4903185 |
The Census dataset gives us the more-accurate (at that time) 2010 Census population count (CENSUS2010POP
) as well as the more-recent 2019 population estimate (POPESTIMATE2019
) for each state.
I could thus combine these data with those in my police_shootings_by_state
data frame to have both the number of incidents and the 2019 population estimate for each state. This would allow me to calculate the per capita rates.
However, in order to combine those data, I need a variable appearing in both datasets that would allow me to align the data correctly (i.e., ensure the frequency of shootings for Alabama correctly matches up with its population estimate).
In this case, we do have a common identifier in both datasets: the name of the state. This information can be found in the state
variable of the police_shootings_by_state
data frame and in the NAME
variable of the population_data
data frame.
However, we have a small problem: the police_shootings_by_state
data use state codes (e.g., AL
) while our population_data
data use full state names (e.g., Alabama
).
In order to connect those two data frames, we’ll need to introduce a third set of data that simply serves as an intermediary to help link the codes with their full names. We sometimes call these kinds of datasets translation tables.
In this case, I can simply use a translation table from this source. I would start by loading it in:
state_translation_table <- read_csv("https://books.rodrigozamith.com/data-driven-storytelling/files/state_code_name_table.csv")
## Rows: 51 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): State, Abbrev, Code
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here’s what that data frame looks like:
head(state_translation_table)
State | Abbrev | Code |
---|---|---|
Alabama | Ala. | AL |
Alaska | Alaska | AK |
Arizona | Ariz. | AZ |
Arkansas | Ark. | AR |
California | Calif. | CA |
Colorado | Colo. | CO |
As you can see, we have three variables in this data frame, each representing a different way of identifying a state (observation).
In order to join our data, I thus need to complete two steps.
The first step is to combine my police_shootings_by_state
data frame with my state_translation_table
data frame in order to have an identifier I can match to my eventual population_data
data frame.
Since I only care about the states that are present in my police_shootings_by_state
data frame, I should perform a left_join()
.
It’s important to note that our common variable (the state code) is represented by a different variable in each dataset (state
in police_shootings_by_state
and Code
in state_translation_table
). We’ll thus have to modify our by
argument to use a new kind of syntax:
police_shootings_by_state_name <- left_join(x=police_shootings_by_state, y=state_translation_table, by=c("state"="Code"))
An alternative, but less efficient, approach would be to rename the columns in the two data frames prior to joining them. Instead, this syntax simply tells us that the first argument in our vector is equal to the second argument—that is, the information in the state
variable from the x
argument data frame is equivalent to the Code
variable in the y
argument data frame.
Here’s what our resulting data frame looks like:
head(police_shootings_by_state_name)
state | n | State | Abbrev |
---|---|---|---|
CA | 144 | California | Calif. |
FL | 92 | Florida | Fla. |
TX | 81 | Texas | Tex. |
AZ | 44 | Arizona | Ariz. |
CO | 42 | Colorado | Colo. |
GA | 40 | Georgia | Ga. |
Notice that we have our original state
and n
variables, as well as the new State
and Abbrev
variables. (The Code
variable from our state_translation_table
data frame was dropped in favor of the state
variable from our primary dataset.)
Then, my second step would be to combine my resulting data frame (police_shootings_by_state_name
) with the data frame that contains my population information (population_data
).
I will again use left_join()
since I only care about the states for which I have data. However, I will change my key (by
variables) since I am now connecting the full state names. As was the case in the earlier step, we have different variable names for the same information, so we have to adjust our by
argument accordingly:
police_shootings_by_state_population <- left_join(x=police_shootings_by_state_name, y=population_data, by=c("State"="NAME"))
Here is what our resulting data frame looks like:
head(police_shootings_by_state_population)
state | n | State | Abbrev | SUMLEV | REGION | DIVISION | STATE | CENSUS2010POP | POPESTIMATE2019 |
---|---|---|---|---|---|---|---|---|---|
CA | 144 | California | Calif. | 40 | 4 | 9 | 6 | 37253956 | 39512223 |
FL | 92 | Florida | Fla. | 40 | 3 | 5 | 12 | 18801310 | 21477737 |
TX | 81 | Texas | Tex. | 40 | 3 | 7 | 48 | 25145561 | 28995881 |
AZ | 44 | Arizona | Ariz. | 40 | 4 | 8 | 4 | 6392017 | 7278717 |
CO | 42 | Colorado | Colo. | 40 | 4 | 8 | 8 | 5029196 | 5758736 |
GA | 40 | Georgia | Ga. | 40 | 3 | 5 | 13 | 9687653 | 10617423 |
Now that I have a single data frame with information about the number of incidents in each state as well as that state’s 2019 population estimate, I can easily compute a new variable with the per capita rate using the mutate()
function.
Here’s how I might compute a new variable called (per_100000
) that contains the number of police shootings per 100,000 residents (based on the 2019 estimates), and sort the resulting data frame in ascending (low to high) order based on that new variable.
police_shootings_by_state_population %>%
select(State, n, POPESTIMATE2019) %>%
mutate(per_100000=n/POPESTIMATE2019*100000) %>%
arrange(per_100000)
State | n | POPESTIMATE2019 | per_100000 |
---|---|---|---|
Massachusetts | 6 | 6892503 | 0.0870511 |
New Jersey | 8 | 8882190 | 0.0900679 |
New York | 19 | 19453561 | 0.0976685 |
Illinois | 13 | 12671821 | 0.1025898 |
Connecticut | 5 | 3565287 | 0.1402412 |
Hawaii | 2 | 1415872 | 0.1412557 |
District of Columbia | 1 | 705749 | 0.1416934 |
Minnesota | 8 | 5639632 | 0.1418532 |
Pennsylvania | 19 | 12801989 | 0.1484144 |
Michigan | 16 | 9986857 | 0.1602106 |
Wyoming | 1 | 578759 | 0.1727835 |
Virginia | 15 | 8535519 | 0.1757362 |
Iowa | 6 | 3155070 | 0.1901701 |
Ohio | 23 | 11689100 | 0.1967645 |
Delaware | 2 | 973764 | 0.2053886 |
Nebraska | 4 | 1934408 | 0.2067816 |
South Dakota | 2 | 884659 | 0.2260758 |
Maryland | 14 | 6045680 | 0.2315703 |
Kansas | 7 | 2913314 | 0.2402762 |
South Carolina | 14 | 5148714 | 0.2719126 |
Texas | 81 | 28995881 | 0.2793500 |
Wisconsin | 17 | 5822434 | 0.2919741 |
New Hampshire | 4 | 1359711 | 0.2941802 |
Maine | 4 | 1344212 | 0.2975721 |
North Carolina | 34 | 10488084 | 0.3241774 |
Alabama | 16 | 4903185 | 0.3263185 |
West Virginia | 6 | 1792147 | 0.3347940 |
Idaho | 6 | 1787065 | 0.3357460 |
Mississippi | 10 | 2976149 | 0.3360047 |
Oregon | 15 | 4217737 | 0.3556410 |
Indiana | 24 | 6732219 | 0.3564946 |
California | 144 | 39512223 | 0.3644442 |
Arkansas | 11 | 3017804 | 0.3645035 |
Georgia | 40 | 10617423 | 0.3767393 |
North Dakota | 3 | 762062 | 0.3936688 |
Missouri | 25 | 6137428 | 0.4073368 |
Tennessee | 28 | 6829174 | 0.4100057 |
Washington | 32 | 7614893 | 0.4202291 |
Florida | 92 | 21477737 | 0.4283505 |
Louisiana | 20 | 4648794 | 0.4302191 |
Kentucky | 20 | 4467673 | 0.4476603 |
Utah | 17 | 3205958 | 0.5302627 |
Arizona | 44 | 7278717 | 0.6045021 |
Oklahoma | 26 | 3956971 | 0.6570682 |
Colorado | 42 | 5758736 | 0.7293267 |
New Mexico | 16 | 2096829 | 0.7630570 |
Nevada | 25 | 3080156 | 0.8116472 |
Montana | 10 | 1068778 | 0.9356480 |
Alaska | 9 | 731545 | 1.2302729 |
Now we can see that although Massachusetts was tied for 11th in the lowest frequency of fatal police shootings across U.S. states, it actually has the lowest rate of such shootings. Well done, Massachusetts!
While we performed all of our calculations by creating intermediate objects, it’s also worth noting that we could have done all of the above in a single step:
police_shootings %>%
filter(date>="2020-01-01" & date<="2020-12-31") %>%
group_by(state) %>%
count() %>%
ungroup() %>%
left_join(y=state_translation_table, by=c("state"="Code")) %>%
left_join(y=population_data, by=c("State"="NAME")) %>%
select(State, n, POPESTIMATE2019) %>%
mutate(per_100000=n/POPESTIMATE2019*100000) %>%
arrange(per_100000)
State | n | POPESTIMATE2019 | per_100000 |
---|---|---|---|
Massachusetts | 6 | 6892503 | 0.0870511 |
New Jersey | 8 | 8882190 | 0.0900679 |
New York | 19 | 19453561 | 0.0976685 |
Illinois | 13 | 12671821 | 0.1025898 |
Connecticut | 5 | 3565287 | 0.1402412 |
Hawaii | 2 | 1415872 | 0.1412557 |
District of Columbia | 1 | 705749 | 0.1416934 |
Minnesota | 8 | 5639632 | 0.1418532 |
Pennsylvania | 19 | 12801989 | 0.1484144 |
Michigan | 16 | 9986857 | 0.1602106 |
Wyoming | 1 | 578759 | 0.1727835 |
Virginia | 15 | 8535519 | 0.1757362 |
Iowa | 6 | 3155070 | 0.1901701 |
Ohio | 23 | 11689100 | 0.1967645 |
Delaware | 2 | 973764 | 0.2053886 |
Nebraska | 4 | 1934408 | 0.2067816 |
South Dakota | 2 | 884659 | 0.2260758 |
Maryland | 14 | 6045680 | 0.2315703 |
Kansas | 7 | 2913314 | 0.2402762 |
South Carolina | 14 | 5148714 | 0.2719126 |
Texas | 81 | 28995881 | 0.2793500 |
Wisconsin | 17 | 5822434 | 0.2919741 |
New Hampshire | 4 | 1359711 | 0.2941802 |
Maine | 4 | 1344212 | 0.2975721 |
North Carolina | 34 | 10488084 | 0.3241774 |
Alabama | 16 | 4903185 | 0.3263185 |
West Virginia | 6 | 1792147 | 0.3347940 |
Idaho | 6 | 1787065 | 0.3357460 |
Mississippi | 10 | 2976149 | 0.3360047 |
Oregon | 15 | 4217737 | 0.3556410 |
Indiana | 24 | 6732219 | 0.3564946 |
California | 144 | 39512223 | 0.3644442 |
Arkansas | 11 | 3017804 | 0.3645035 |
Georgia | 40 | 10617423 | 0.3767393 |
North Dakota | 3 | 762062 | 0.3936688 |
Missouri | 25 | 6137428 | 0.4073368 |
Tennessee | 28 | 6829174 | 0.4100057 |
Washington | 32 | 7614893 | 0.4202291 |
Florida | 92 | 21477737 | 0.4283505 |
Louisiana | 20 | 4648794 | 0.4302191 |
Kentucky | 20 | 4467673 | 0.4476603 |
Utah | 17 | 3205958 | 0.5302627 |
Arizona | 44 | 7278717 | 0.6045021 |
Oklahoma | 26 | 3956971 | 0.6570682 |
Colorado | 42 | 5758736 | 0.7293267 |
New Mexico | 16 | 2096829 | 0.7630570 |
Nevada | 25 | 3080156 | 0.8116472 |
Montana | 10 | 1068778 | 0.9356480 |
Alaska | 9 | 731545 | 1.2302729 |
This assumes, of course, that we have already read in the data from the three CSV files into their appropriate objects (e.g., state_translation_table
).
This helps to keep our working environment clean and ensures that we are always working from the same, unmodified data. (It can be easy to forget which changes you made to the data as you create new intermediary objects, and those changes can affect your calculations going forward.)