In this tutorial, we will be applying some of the tidyverse
functions to an initial exploratory data analysis of state-level data from the Eviction Lab. Specifically, we’ll be learning how to:
Use the %in%
operator.
Perform partial string matching.
Compute your own variables, including per capita rates and percent changes.
Use the subscript notation to pull out specific information.
Perform lagged calculations like year-over-year changes.
Calculate the linear correlation between two variables.
Recode values.
The Eviction Lab has collected data from 80 million records around the country and you can read about the methodology here. As always, it is helpful to review the data dictionary to understand the meaning of the different variables (what each variable name corresponds to). If you want more detail about any of the variables, I encourage you to review the full Methodology Report.
The first step, of course, is to read in the data. Like before, we’ll use the readr::read_csv()
function:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
us_states <- read_csv("https://dds.rodrigozamith.com/files/evictions_us_states.csv")
## Rows: 867 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): GEOID, name, parent-location
## dbl (24): year, population, poverty-rate, renter-occupied-households, pct-re...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Normally, we would make sure the data imported correctly. Because we’ve previously evaluated the data from this data source, we can just take a quick look at the first few observations to confirm that the data were imported correctly.
head(us_states)
GEOID | year | name | parent-location | population | poverty-rate | renter-occupied-households | pct-renter-occupied | median-gross-rent | median-household-income | median-property-value | rent-burden | pct-white | pct-af-am | pct-hispanic | pct-am-ind | pct-asian | pct-nh-pi | pct-multiple | pct-other | eviction-filings | evictions | eviction-rate | eviction-filing-rate | low-flag | imputed | subbed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | 2000 | Alabama | USA | 4447100 | 16.10 | 255910 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 7585 | 4583 | 1.79 | 2.96 | 0 | 0 | 0 |
01 | 2001 | Alabama | USA | 4447100 | 16.10 | 486061 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 20828 | 10106 | 2.08 | 4.29 | 0 | 0 | 0 |
01 | 2002 | Alabama | USA | 4447100 | 16.10 | 495329 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 21070 | 9603 | 1.94 | 4.25 | 0 | 0 | 0 |
01 | 2003 | Alabama | USA | 4447100 | 16.10 | 456138 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 15750 | 7379 | 1.62 | 3.45 | 0 | 0 | 0 |
01 | 2004 | Alabama | USA | 4447100 | 16.10 | 446058 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 13508 | 6170 | 1.38 | 3.03 | 0 | 0 | 0 |
01 | 2005 | Alabama | USA | 4633360 | 12.86 | 449922 | 29.24 | 621 | 41216 | 111900 | 29.3 | 68.50 | 26.01 | 2.81 | 0.46 | 1.0 | 0.03 | 1.10 | 0.08 | 15426 | 8300 | 1.84 | 3.43 | 0 | 0 | 0 |
Just like before, it appears we imported the data correctly. Each observation represents a different state (name
) at a different point in time (year
).
%in%
operatorWhile most of the data crunching we do as data journalists involves numbers, we sometimes have to work with strings, too.
So, I’d like to begin this more advanced data analysis by asking my data: Which New England state had the highest eviction rate in 2016?
Our dataset does not have a variable covering the region represented by the state. Instead, we’ll have to create our own filter that only includes those states. We can turn to the U.S. Census Bureau-designated regions and divisions for help in determining which states are part of New England.
Rather than creating a filter with a bunch of different OR conditions, I can just use the %in%
operator and pair it with a vector of values that I would like to include. Put another way, I can use the filter()
function and supply the conditional as variable %in% c("vector", "with", "multiple", "values")
.
With this information in mind, here’s what I’d need to do to answer that question:
Call the data from my data frame.
Apply a filter to cover the desired year and the desired states.
Sort the data in descending order based on the variable of interest.
Optional: Select only the variables of interest to me, so I can more quickly see the answer.
Now, let’s translate that logic into code:
us_states %>%
filter(name %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") & year == 2016) %>%
arrange(desc(`eviction-rate`)) %>%
select(year, name, `eviction-rate`)
year | name | eviction-rate |
---|---|---|
2016 | Rhode Island | 3.07 |
2016 | Connecticut | 3.04 |
2016 | Maine | 2.26 |
2016 | New Hampshire | 1.70 |
2016 | Massachusetts | 1.52 |
2016 | Vermont | 0.09 |
As we can see, Rhode Island was the New England state with the highest eviction rate in 2016.
str_starts()
and str_ends()
Similarly, sometimes we want to create a filter for our data so we only include observations that begin with or end with certain text.
In this case, I’d like to ask my data: Which of the Virginias had the highest average eviction rate between 2013 and 2016?
One approach would be to create a filter with just those two state names. However, I can also use the stringr::str_ends()
function to perform a search that looks for partial matches by evaluating the end of the string.
stringr
is automatically loaded by the tidyverse
package.
The arguments I need to provide str_ends()
function are (1) the variable I’m evaluating (the string
argument, which comes first) and (2) the pattern or value that must be present at the end for it to yield a TRUE
response (the pattern
argument, which comes second). As a reminder, the TRUE
responses are the ones included in the results when we apply a filter with the filter()
function.
So, let’s try to create a filter (filter()
) that only includes observations where the name
variable ends (str_ends()
) with the string "Virginia"
:
us_states %>%
filter(str_starts(name, "Virginia"))
GEOID | year | name | parent-location | population | poverty-rate | renter-occupied-households | pct-renter-occupied | median-gross-rent | median-household-income | median-property-value | rent-burden | pct-white | pct-af-am | pct-hispanic | pct-am-ind | pct-asian | pct-nh-pi | pct-multiple | pct-other | eviction-filings | evictions | eviction-rate | eviction-filing-rate | low-flag | imputed | subbed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51 | 2000 | Virginia | USA | 7078515 | 9.59 | 664083 | 31.91 | 650 | 46677 | 125400 | 24.5 | 70.15 | 19.44 | 4.66 | 0.26 | 3.66 | 0.05 | 1.61 | 0.17 | 77522 | 40945 | 6.17 | 11.67 | 0 | 0 | 0 |
51 | 2001 | Virginia | USA | 7078515 | 9.59 | 734527 | 31.91 | 650 | 46677 | 125400 | 24.5 | 70.15 | 19.44 | 4.66 | 0.26 | 3.66 | 0.05 | 1.61 | 0.17 | 83498 | 44970 | 6.12 | 11.37 | 0 | 0 | 0 |
51 | 2002 | Virginia | USA | 7078515 | 9.59 | 761196 | 31.91 | 650 | 46677 | 125400 | 24.5 | 70.15 | 19.44 | 4.66 | 0.26 | 3.66 | 0.05 | 1.61 | 0.17 | 123429 | 49799 | 6.54 | 16.22 | 0 | 0 | 0 |
51 | 2003 | Virginia | USA | 7078515 | 9.59 | 885087 | 31.91 | 650 | 46677 | 125400 | 24.5 | 70.15 | 19.44 | 4.66 | 0.26 | 3.66 | 0.05 | 1.61 | 0.17 | 127023 | 52790 | 5.96 | 14.35 | 0 | 0 | 0 |
51 | 2004 | Virginia | USA | 7078515 | 9.59 | 900780 | 31.91 | 650 | 46677 | 125400 | 24.5 | 70.15 | 19.44 | 4.66 | 0.26 | 3.66 | 0.05 | 1.61 | 0.17 | 153631 | 55869 | 6.20 | 17.06 | 0 | 0 | 0 |
51 | 2005 | Virginia | USA | 7721730 | 7.16 | 914428 | 30.84 | 931 | 60316 | 247100 | 28.7 | 66.98 | 19.31 | 6.67 | 0.23 | 4.78 | 0.07 | 1.70 | 0.25 | 156448 | 51745 | 5.66 | 17.11 | 0 | 0 | 0 |
51 | 2006 | Virginia | USA | 7721730 | 7.16 | 928075 | 30.84 | 931 | 60316 | 247100 | 28.7 | 66.98 | 19.31 | 6.67 | 0.23 | 4.78 | 0.07 | 1.70 | 0.25 | 158878 | 50920 | 5.49 | 17.12 | 0 | 0 | 0 |
51 | 2007 | Virginia | USA | 7721730 | 7.16 | 613794 | 30.84 | 931 | 60316 | 247100 | 28.7 | 66.98 | 19.31 | 6.67 | 0.23 | 4.78 | 0.07 | 1.70 | 0.25 | 66538 | 37676 | 6.14 | 10.84 | 0 | 0 | 0 |
51 | 2008 | Virginia | USA | 7721730 | 7.16 | 623822 | 30.84 | 931 | 60316 | 247100 | 28.7 | 66.98 | 19.31 | 6.67 | 0.23 | 4.78 | 0.07 | 1.70 | 0.25 | 65233 | 42797 | 6.86 | 10.46 | 0 | 0 | 0 |
51 | 2009 | Virginia | USA | 7721730 | 7.16 | 629765 | 30.84 | 931 | 60316 | 247100 | 28.7 | 66.98 | 19.31 | 6.67 | 0.23 | 4.78 | 0.07 | 1.70 | 0.25 | 62974 | 38357 | 6.09 | 10.00 | 0 | 0 | 0 |
51 | 2010 | Virginia | USA | 8001024 | 7.79 | 685038 | 32.75 | 1060 | 63636 | 249700 | 29.9 | 64.82 | 19.04 | 7.90 | 0.26 | 5.45 | 0.06 | 2.27 | 0.19 | 94057 | 45125 | 6.59 | 13.73 | 0 | 0 | 0 |
51 | 2011 | Virginia | USA | 8256630 | 8.22 | 866175 | 33.82 | 1116 | 65015 | 245000 | 29.9 | 63.44 | 18.90 | 8.59 | 0.21 | 5.93 | 0.06 | 2.66 | 0.22 | 144511 | 50013 | 5.77 | 16.68 | 0 | 0 | 0 |
51 | 2012 | Virginia | USA | 8256630 | 8.22 | 881194 | 33.82 | 1116 | 65015 | 245000 | 29.9 | 63.44 | 18.90 | 8.59 | 0.21 | 5.93 | 0.06 | 2.66 | 0.22 | 138282 | 48198 | 5.47 | 15.69 | 0 | 0 | 0 |
51 | 2013 | Virginia | USA | 8256630 | 8.22 | 906176 | 33.82 | 1116 | 65015 | 245000 | 29.9 | 63.44 | 18.90 | 8.59 | 0.21 | 5.93 | 0.06 | 2.66 | 0.22 | 151841 | 51134 | 5.64 | 16.76 | 0 | 0 | 0 |
51 | 2014 | Virginia | USA | 8256630 | 8.22 | 917828 | 33.82 | 1116 | 65015 | 245000 | 29.9 | 63.44 | 18.90 | 8.59 | 0.21 | 5.93 | 0.06 | 2.66 | 0.22 | 147094 | 51101 | 5.57 | 16.03 | 0 | 0 | 0 |
51 | 2015 | Virginia | USA | 8256630 | 8.22 | 996441 | 33.82 | 1116 | 65015 | 245000 | 29.9 | 63.44 | 18.90 | 8.59 | 0.21 | 5.93 | 0.06 | 2.66 | 0.22 | 144220 | 50213 | 5.04 | 14.47 | 0 | 0 | 0 |
51 | 2016 | Virginia | USA | 8256630 | 8.22 | 1012175 | 33.82 | 1116 | 65015 | 245000 | 29.9 | 63.44 | 18.90 | 8.59 | 0.21 | 5.93 | 0.06 | 2.66 | 0.22 | 146534 | 51821 | 5.12 | 14.48 | 0 | 0 | 0 |
We are using str_ends()
(which comes with stringr
package) and not ends_with()
(which comes with the sister dplyr
package). dplyr::ends_with()
is a separate function used to select multiple variables from a data frame based on patterns in the variable names, and can be useful when you have a large dataset with hundreds of variables and want a sub-group of them. stringr::str_ends()
evaluates a given string or vector to see if it matches the condition, yielding a TRUE
or FALSE
response that can serve as the basis of a filter.
Unlike prior instances of dplyr::filter()
where we specified a variable, an evaluation operator, and the condition (e.g., name == "Virginia"
), the str_ends()
function expects a different syntax. This is common in R, especially when you are using functions from different packages. Thus, be sure to read each function’s documentation to understand what it’s looking for.
We can similarly look at the start of a string by using the str_starts()
function, which also comes with R. For example, let’s include only the states (name
) that begin with the string "North"
:
us_states %>%
filter(str_starts(name, "North"))
GEOID | year | name | parent-location | population | poverty-rate | renter-occupied-households | pct-renter-occupied | median-gross-rent | median-household-income | median-property-value | rent-burden | pct-white | pct-af-am | pct-hispanic | pct-am-ind | pct-asian | pct-nh-pi | pct-multiple | pct-other | eviction-filings | evictions | eviction-rate | eviction-filing-rate | low-flag | imputed | subbed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37 | 2000 | North Carolina | USA | 8049313 | 12.28 | 393465 | 30.64 | 548 | 39184 | 108300 | 24.3 | 70.16 | 21.41 | 4.71 | 1.18 | 1.40 | 0.04 | 0.99 | 0.11 | 26050 | 22292 | 5.67 | 6.62 | 0 | 0 | 0 |
37 | 2001 | North Carolina | USA | 8049313 | 12.28 | 495885 | 30.64 | 548 | 39184 | 108300 | 24.3 | 70.16 | 21.41 | 4.71 | 1.18 | 1.40 | 0.04 | 0.99 | 0.11 | 49761 | 18780 | 3.79 | 10.03 | 0 | 0 | 0 |
37 | 2002 | North Carolina | USA | 8049313 | 12.28 | 566782 | 30.64 | 548 | 39184 | 108300 | 24.3 | 70.16 | 21.41 | 4.71 | 1.18 | 1.40 | 0.04 | 0.99 | 0.11 | 88107 | 18880 | 3.33 | 15.55 | 0 | 0 | 0 |
37 | 2003 | North Carolina | USA | 8049313 | 12.28 | 618050 | 30.64 | 548 | 39184 | 108300 | 24.3 | 70.16 | 21.41 | 4.71 | 1.18 | 1.40 | 0.04 | 0.99 | 0.11 | 92789 | 19057 | 3.08 | 15.01 | 0 | 0 | 0 |
37 | 2004 | North Carolina | USA | 8049313 | 12.28 | 734415 | 30.64 | 548 | 39184 | 108300 | 24.3 | 70.16 | 21.41 | 4.71 | 1.18 | 1.40 | 0.04 | 0.99 | 0.11 | 114477 | 36944 | 5.03 | 15.59 | 0 | 0 | 0 |
37 | 2005 | North Carolina | USA | 9045705 | 11.06 | 1036534 | 31.95 | 702 | 45069 | 143700 | 29.1 | 67.45 | 20.94 | 7.00 | 1.08 | 1.90 | 0.04 | 1.39 | 0.20 | 190778 | 50800 | 4.90 | 18.41 | 0 | 0 | 0 |
37 | 2006 | North Carolina | USA | 9045705 | 11.06 | 1063201 | 31.95 | 702 | 45069 | 143700 | 29.1 | 67.45 | 20.94 | 7.00 | 1.08 | 1.90 | 0.04 | 1.39 | 0.20 | 177708 | 51731 | 4.87 | 16.71 | 0 | 0 | 0 |
37 | 2007 | North Carolina | USA | 9045705 | 11.06 | 1082018 | 31.95 | 702 | 45069 | 143700 | 29.1 | 67.45 | 20.94 | 7.00 | 1.08 | 1.90 | 0.04 | 1.39 | 0.20 | 156923 | 63019 | 5.82 | 14.50 | 0 | 0 | 0 |
37 | 2008 | North Carolina | USA | 9045705 | 11.06 | 1114318 | 31.95 | 702 | 45069 | 143700 | 29.1 | 67.45 | 20.94 | 7.00 | 1.08 | 1.90 | 0.04 | 1.39 | 0.20 | 166874 | 71269 | 6.40 | 14.98 | 0 | 0 | 0 |
37 | 2009 | North Carolina | USA | 9045705 | 11.06 | 1146099 | 31.95 | 702 | 45069 | 143700 | 29.1 | 67.45 | 20.94 | 7.00 | 1.08 | 1.90 | 0.04 | 1.39 | 0.20 | 160703 | 67767 | 5.91 | 14.02 | 0 | 0 | 0 |
37 | 2010 | North Carolina | USA | 9535483 | 12.41 | 1173438 | 33.30 | 759 | 46450 | 153600 | 30.3 | 65.27 | 21.18 | 8.39 | 1.14 | 2.17 | 0.06 | 1.63 | 0.16 | 167293 | 66669 | 5.68 | 14.26 | 0 | 0 | 0 |
37 | 2011 | North Carolina | USA | 9845333 | 12.79 | 1203775 | 34.87 | 797 | 46868 | 154900 | 30.3 | 64.24 | 21.17 | 8.84 | 1.10 | 2.46 | 0.05 | 1.94 | 0.20 | 165870 | 63278 | 5.26 | 13.78 | 0 | 0 | 0 |
37 | 2012 | North Carolina | USA | 9845333 | 12.79 | 1234112 | 34.87 | 797 | 46868 | 154900 | 30.3 | 64.24 | 21.17 | 8.84 | 1.10 | 2.46 | 0.05 | 1.94 | 0.20 | 150500 | 57668 | 4.67 | 12.20 | 0 | 0 | 0 |
37 | 2013 | North Carolina | USA | 9845333 | 12.79 | 1264451 | 34.87 | 797 | 46868 | 154900 | 30.3 | 64.24 | 21.17 | 8.84 | 1.10 | 2.46 | 0.05 | 1.94 | 0.20 | 156613 | 56020 | 4.43 | 12.39 | 0 | 0 | 0 |
37 | 2014 | North Carolina | USA | 9845333 | 12.79 | 1294790 | 34.87 | 797 | 46868 | 154900 | 30.3 | 64.24 | 21.17 | 8.84 | 1.10 | 2.46 | 0.05 | 1.94 | 0.20 | 156090 | 60365 | 4.66 | 12.06 | 0 | 0 | 0 |
37 | 2015 | North Carolina | USA | 9845333 | 12.79 | 1325128 | 34.87 | 797 | 46868 | 154900 | 30.3 | 64.24 | 21.17 | 8.84 | 1.10 | 2.46 | 0.05 | 1.94 | 0.20 | 148593 | 59940 | 4.52 | 11.21 | 0 | 0 | 0 |
37 | 2016 | North Carolina | USA | 9845333 | 12.79 | 1355467 | 34.87 | 797 | 46868 | 154900 | 30.3 | 64.24 | 21.17 | 8.84 | 1.10 | 2.46 | 0.05 | 1.94 | 0.20 | 147038 | 62539 | 4.61 | 10.85 | 0 | 0 | 0 |
38 | 2000 | North Dakota | USA | 642200 | 11.86 | 85853 | 33.39 | 412 | 34604 | 74400 | 22.3 | 91.74 | 0.59 | 1.21 | 4.79 | 0.56 | 0.03 | 1.04 | 0.04 | 408 | NA | NA | 0.48 | 0 | 0 | 0 |
38 | 2001 | North Dakota | USA | 642200 | 11.86 | 86993 | 33.39 | 412 | 34604 | 74400 | 22.3 | 91.74 | 0.59 | 1.21 | 4.79 | 0.56 | 0.03 | 1.04 | 0.04 | 456 | NA | NA | 0.52 | 0 | 0 | 0 |
38 | 2002 | North Dakota | USA | 642200 | 11.86 | 88132 | 33.39 | 412 | 34604 | 74400 | 22.3 | 91.74 | 0.59 | 1.21 | 4.79 | 0.56 | 0.03 | 1.04 | 0.04 | 515 | NA | NA | 0.58 | 0 | 0 | 0 |
38 | 2003 | North Dakota | USA | 642200 | 11.86 | 89272 | 33.39 | 412 | 34604 | 74400 | 22.3 | 91.74 | 0.59 | 1.21 | 4.79 | 0.56 | 0.03 | 1.04 | 0.04 | 474 | NA | NA | 0.53 | 0 | 0 | 0 |
38 | 2004 | North Dakota | USA | 642200 | 11.86 | 90411 | 33.39 | 412 | 34604 | 74400 | 22.3 | 91.74 | 0.59 | 1.21 | 4.79 | 0.56 | 0.03 | 1.04 | 0.04 | 522 | NA | NA | 0.58 | 0 | 0 | 0 |
38 | 2005 | North Dakota | USA | 639725 | 7.37 | 91551 | 34.32 | 535 | 45140 | 104300 | 25.3 | 89.62 | 0.91 | 1.97 | 5.25 | 0.82 | 0.05 | 1.34 | 0.05 | 503 | NA | NA | 0.55 | 0 | 0 | 0 |
38 | 2006 | North Dakota | USA | 639725 | 7.37 | 92691 | 34.32 | 535 | 45140 | 104300 | 25.3 | 89.62 | 0.91 | 1.97 | 5.25 | 0.82 | 0.05 | 1.34 | 0.05 | 560 | NA | NA | 0.60 | 0 | 0 | 0 |
38 | 2007 | North Dakota | USA | 639725 | 7.37 | 93830 | 34.32 | 535 | 45140 | 104300 | 25.3 | 89.62 | 0.91 | 1.97 | 5.25 | 0.82 | 0.05 | 1.34 | 0.05 | 501 | NA | NA | 0.53 | 0 | 0 | 0 |
38 | 2008 | North Dakota | USA | 639725 | 7.37 | 94970 | 34.32 | 535 | 45140 | 104300 | 25.3 | 89.62 | 0.91 | 1.97 | 5.25 | 0.82 | 0.05 | 1.34 | 0.05 | 473 | NA | NA | 0.50 | 0 | 0 | 0 |
38 | 2009 | North Dakota | USA | 639725 | 7.37 | 96109 | 34.32 | 535 | 45140 | 104300 | 25.3 | 89.62 | 0.91 | 1.97 | 5.25 | 0.82 | 0.05 | 1.34 | 0.05 | 495 | NA | NA | 0.52 | 0 | 0 | 0 |
38 | 2010 | North Dakota | USA | 672591 | 7.45 | 97249 | 34.58 | 610 | 51641 | 123900 | 24.9 | 88.91 | 1.15 | 2.00 | 5.29 | 1.02 | 0.04 | 1.54 | 0.05 | 533 | NA | NA | 0.55 | 0 | 0 | 0 |
38 | 2011 | North Dakota | USA | 721640 | 7.23 | 101400 | 35.86 | 709 | 57181 | 153800 | 24.9 | 87.05 | 1.60 | 2.85 | 5.20 | 1.22 | 0.04 | 2.00 | 0.05 | 618 | NA | NA | 0.61 | 0 | 0 | 0 |
38 | 2012 | North Dakota | USA | 721640 | 7.23 | 105552 | 35.86 | 709 | 57181 | 153800 | 24.9 | 87.05 | 1.60 | 2.85 | 5.20 | 1.22 | 0.04 | 2.00 | 0.05 | 680 | NA | NA | 0.64 | 0 | 0 | 0 |
38 | 2013 | North Dakota | USA | 721640 | 7.23 | 109703 | 35.86 | 709 | 57181 | 153800 | 24.9 | 87.05 | 1.60 | 2.85 | 5.20 | 1.22 | 0.04 | 2.00 | 0.05 | 790 | NA | NA | 0.72 | 0 | 0 | 0 |
38 | 2014 | North Dakota | USA | 721640 | 7.23 | 113854 | 35.86 | 709 | 57181 | 153800 | 24.9 | 87.05 | 1.60 | 2.85 | 5.20 | 1.22 | 0.04 | 2.00 | 0.05 | 1054 | NA | NA | 0.93 | 0 | 0 | 0 |
38 | 2015 | North Dakota | USA | 721640 | 7.23 | 118006 | 35.86 | 709 | 57181 | 153800 | 24.9 | 87.05 | 1.60 | 2.85 | 5.20 | 1.22 | 0.04 | 2.00 | 0.05 | 1187 | NA | NA | 1.01 | 0 | 0 | 0 |
38 | 2016 | North Dakota | USA | 721640 | 7.23 | 122157 | 35.86 | 709 | 57181 | 153800 | 24.9 | 87.05 | 1.60 | 2.85 | 5.20 | 1.22 | 0.04 | 2.00 | 0.05 | 1339 | NA | NA | 1.10 | 0 | 0 | 0 |
The pattern
argument accepts regular expressions, which allow for more complex pattern matching. (For example, maybe you are looking for a string that only has letters in it, and no numbers.) In practice, this means that commonly used characters like the dollar sign ($
) have a special meaning in those patterns (in this case, $
indicates the match should occur at the end of the string — which is especially useless in our context). That can be a source of frustration if you’re just looking for a dollar value in a string. To get around that, enclose the pattern you enter within the coll()
function like so: str_starts(amount, coll("$300"))
. This would bypass any regular expression matching, and simply interpret your text literally (as a human would). Put another way, that example would capture an amount like “$300,000” (assuming it is being stored as a string in your data frame) because it started with “$300”.
So, let’s return to our original question: Which of the Virginias had the highest average eviction rate between 2013 and 2016?
Here’s what I’d need to do to answer that question:
Call the data from my data frame.
Apply a filter to cover the observations ending with my desired string and matching the years within my desired range.
Group the data based on the grouping variable of interest.
Calculate the summary statistic for the mean of the desired variable.
We can translate that into code like so:
us_states %>%
filter(str_ends(name, "Virginia") & year >= 2013 & year <= 2016) %>%
group_by(name) %>%
summarize(mean_eviction_rate=mean(`eviction-rate`))
name | mean_eviction_rate |
---|---|
Virginia | 5.3425 |
West Virginia | 3.6525 |
As we can see here, the state of Virginia had the higher eviction rate during that period of time.
What if we want to just filter observations based on some text that is present in the middle of a string? Here, we can use stringr::str_detect()
.
str_detect()
allows you to match text based on patterns, or regular expressions. These patterns can get very complicated (but are quite powerful), and you can read more about them here.
For the sake of simplicity, we won’t cover patterns here. Instead, we’ll just use str_detect()
to look for entries that include a specified string anywhere within a given variable. For example, we can filter in any state with “y” in its name and count the number of unique states like so:
us_states %>%
filter(str_detect(name, "y")) %>%
count(name)
name | n |
---|---|
Kentucky | 17 |
Maryland | 17 |
New Jersey | 17 |
Pennsylvania | 17 |
Wyoming | 17 |
Be careful when using punctuation with str_detect()
. First, the function is case-sensitive by default, so searching for “Y” will give you different results than “y” (i.e., New York is covered by “Y” but not “y”). Second, the function may think you are giving it instructions for pattern-seeking. For example, “^North” tells it to match only when “North” appears at the start of the sentence—just like str_starts(). Any special characters need to be escaped, or you can use the coll()
function as described above. You can read about that in the documentation for the function.
The third question I’d like to ask my data is: What were the eviction rates in Massachusetts between 2010 and 2016, expressed on a per 100,000 people basis?
We already have data about eviction rates in the eviction-rate
variable. However, those data are on a per 100 renter homes basis. In order to address that question, I would need to calculate my own rate. Thankfully, I have the variables necessary to do that (evictions
and population
).
To compute a new variable, we can use the dplyr::mutate()
function. mutate()
operates just like summarize()
in the sense that you give it a variable name and provide the operation necessary to calculate it. However, the key difference is that the variables you create are simply appended to the existing data frame—they do not reduce the data frame into a set of summary statistics.
A rate like the one we are interested in can be easily computed through the following calculation: X / Y
(where X refers to the number of evictions and Y to the population). To get the per 100,000 people basis, we would multiply that result by 100,000.
So, here’s what I’d need to do to answer that third question:
Call the data from my data frame.
Apply a filter to cover the desired state and the timeframe.
Create a new variable (we’ll call it eviction_rate_pop
) based on our rate calculation.
Optional: Select only the variables of interest to me, so I can more quickly see the answer.
And here is how I can translate that into code:
us_states %>%
filter(name == "Massachusetts" & between(year, 2010, 2016)) %>%
mutate(eviction_rate_pop = (evictions/population)*100000) %>%
select(year, name, population, evictions, eviction_rate_pop)
year | name | population | evictions | eviction_rate_pop |
---|---|---|---|---|
2010 | Massachusetts | 6547629 | 18803 | 287.1727 |
2011 | Massachusetts | 6705586 | 20452 | 304.9994 |
2012 | Massachusetts | 6705586 | 19648 | 293.0094 |
2013 | Massachusetts | 6705586 | 18131 | 270.3865 |
2014 | Massachusetts | 6705586 | 18289 | 272.7428 |
2015 | Massachusetts | 6705586 | 17986 | 268.2241 |
2016 | Massachusetts | 6705586 | 15708 | 234.2525 |
We can also use the dplyr::between()
function to replicate the dual conditions of year >= 2010
and year ⇐ 2016
. They are functionally equivalent.
Looking at this output, I can see that there were 270 evictions for every 100,000 people who lived in Massachusetts in 2013. (And, for other years between 2010 and 2016, too!)
One thing to be mindful of is that the population statistics in our dataset change pretty infrequently. For example, we have the same population
value for all years between 2011-2016. That’s probably because the Eviction Lab team are relying on the most comprehensive counts from the decennial U.S. Census.
However, the U.S. Census also offers less precise, but more frequently updated, annual data from its annual American Community Survey. If we wanted more recent population counts, we would thus need to merge the Eviction Lab data with data from the American Community Survey. This is possible, but there may be a good reason for keeping the less-recent counts, and we should therefore see if the methodological details explain the reason for that decision.
A second thing to be mindful of is that deciding the basis for a rate (e.g., 1,000 vs. 100,000) is more of an art than a science. It requires an understanding of the context and the base rates that your audiences are most familiar with. For example, we tend to see rates for states and cities expressed on a per 100,000 people basis, and it is thus an already familiar anchor for most folks. As such, that’s often a helpful basis to use. But that may not be the case in all contexts.
Alternatively, the Eviction Lab likely provides its rates on a per-100 renters basis because it is easy to also present that as a percentage (e.g., 2.1 per 100 can be translated to 2.1%). In fact, they present that rate as a percentage on the visuals on their website.
The key is to use a basis (‘per’ number) that you believe would be most informative to your audience—and not the one that makes a statistic seem most or least alarming.
A fourth question I would like to ask my data is: How much higher (as a percentage) was Massachusetts’ number of evictions in 2016, relative to Florida’s?
As a reminder, we can calculate a percent change as follows: (Y2-Y1)/abs(Y1)*100
. (The abs()
function allows us to get the absolute value of a number.)
Note that there is a difference between percent change and percent difference. People are typically more familiar with percent change (% increase or % decrease), so you’ll typically see journalists use that calculation. Additionally, keep in mind that it matters which entity (e.g., Massachusetts or Florida) you assign to Y2
. Typically, the value of greatest interest (or the newest value) should be Y2
. When in doubt, double-check your calculations against an online calculator, like the ones linked to here.
I could apply that formula as part of a two-step operation where I manually look up Massachusetts’ number (Y2) of evictions as well as Florida’s (Y1), and then (1) manually subtract Massachusetts’ number from Florida’s, (2) divide the result by Florida’s number, and (3) multiply that by 100.
However, we can also do this in a single operation by calculating a summary statistic using that equation. In order to get the data necessary for that operation, we can use the subscripting approach that is native to R.
R allows us to pull out specific information from its outputs, which can be handy when you want to perform calculations using filtered or mutated data. We have already performed some calculations with different dplyr
functions, but this is a good opportunity to demonstrate an alternative that is sometimes necessary for functions that don’t play nicely with tidyverse
or to do some things we want within tidyverse
.
First, let’s try getting information from a vector. As a reminder, we can call information from a variable in a data frame with the object$variable
notation, such as when we run:
head(us_states$evictions)
## [1] 4583 10106 9603 7379 6170 8300
Let’s say we wanted to get just the first value from that vector (4583
). Here’s where a subscript would be useful. Subscripts allow you to call specific elements from an object by specifying it in brackets. As such, us_states$evictions[1]
would produce only the first element from that vector.
us_states$evictions[1]
## [1] 4583
If your object is a data frame, which has observations (rows) and variables (columns), we subscript using matrix notation [(observation), (variable)]
(or [(row), (column)]
; note the comma in both. For example, if I wanted to get the value in the first row and third column of our us_states
data frame, I’d write:
us_states[1,3]
name |
---|
Alabama |
I could also provide just one of those values (either observation or variable), but keep the comma, in order to get the information from my desired observation or variable. For example, here’s the entire first row in our dataset.
us_states[1,]
GEOID | year | name | parent-location | population | poverty-rate | renter-occupied-households | pct-renter-occupied | median-gross-rent | median-household-income | median-property-value | rent-burden | pct-white | pct-af-am | pct-hispanic | pct-am-ind | pct-asian | pct-nh-pi | pct-multiple | pct-other | eviction-filings | evictions | eviction-rate | eviction-filing-rate | low-flag | imputed | subbed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | 2000 | Alabama | USA | 4447100 | 16.1 | 255910 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 7585 | 4583 | 1.79 | 2.96 | 0 | 0 | 0 |
Within the tidyverse
, we could use slice(us_states, 1)
to replicate us_states[1,]
and select(us_states, 3)
to replicate us_states[,3]
.
Notably, we can also use subscripting to filter information based on a condition (like we do with the filter()
function). For example, if I just wanted rows that had the name
variable be equivalent to “Wyoming”, I could write:
us_states[us_states$name == "Wyoming",]
GEOID | year | name | parent-location | population | poverty-rate | renter-occupied-households | pct-renter-occupied | median-gross-rent | median-household-income | median-property-value | rent-burden | pct-white | pct-af-am | pct-hispanic | pct-am-ind | pct-asian | pct-nh-pi | pct-multiple | pct-other | eviction-filings | evictions | eviction-rate | eviction-filing-rate | low-flag | imputed | subbed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | 2000 | Wyoming | USA | 493782 | 11.42 | 28647 | 30.01 | 437 | 37892 | 96600 | 22.5 | 88.86 | 0.71 | 6.41 | 2.07 | 0.54 | 0.05 | 1.25 | 0.10 | 32 | 31 | 0.11 | 0.11 | 1 | 0 | 0 |
56 | 2001 | Wyoming | USA | 493782 | 11.42 | 29310 | 30.01 | 437 | 37892 | 96600 | 22.5 | 88.86 | 0.71 | 6.41 | 2.07 | 0.54 | 0.05 | 1.25 | 0.10 | 31 | 26 | 0.09 | 0.11 | 1 | 0 | 0 |
56 | 2002 | Wyoming | USA | 493782 | 11.42 | 38257 | 30.01 | 437 | 37892 | 96600 | 22.5 | 88.86 | 0.71 | 6.41 | 2.07 | 0.54 | 0.05 | 1.25 | 0.10 | 152 | 72 | 0.19 | 0.40 | 1 | 0 | 0 |
56 | 2003 | Wyoming | USA | 493782 | 11.42 | 49547 | 30.01 | 437 | 37892 | 96600 | 22.5 | 88.86 | 0.71 | 6.41 | 2.07 | 0.54 | 0.05 | 1.25 | 0.10 | 701 | 368 | 0.74 | 1.41 | 1 | 0 | 0 |
56 | 2004 | Wyoming | USA | 493782 | 11.42 | 50530 | 30.01 | 437 | 37892 | 96600 | 22.5 | 88.86 | 0.71 | 6.41 | 2.07 | 0.54 | 0.05 | 1.25 | 0.10 | 929 | 496 | 0.98 | 1.84 | 1 | 0 | 0 |
56 | 2005 | Wyoming | USA | 523949 | 6.08 | 51513 | 30.10 | 637 | 51990 | 163400 | 24.2 | 86.87 | 0.88 | 7.59 | 2.03 | 0.72 | 0.03 | 1.77 | 0.10 | 991 | 539 | 1.05 | 1.92 | 1 | 0 | 0 |
56 | 2006 | Wyoming | USA | 523949 | 6.08 | 52496 | 30.10 | 637 | 51990 | 163400 | 24.2 | 86.87 | 0.88 | 7.59 | 2.03 | 0.72 | 0.03 | 1.77 | 0.10 | 872 | 487 | 0.93 | 1.66 | 1 | 0 | 0 |
56 | 2007 | Wyoming | USA | 523949 | 6.08 | 53478 | 30.10 | 637 | 51990 | 163400 | 24.2 | 86.87 | 0.88 | 7.59 | 2.03 | 0.72 | 0.03 | 1.77 | 0.10 | 866 | 426 | 0.80 | 1.62 | 1 | 0 | 0 |
56 | 2008 | Wyoming | USA | 523949 | 6.08 | 54460 | 30.10 | 637 | 51990 | 163400 | 24.2 | 86.87 | 0.88 | 7.59 | 2.03 | 0.72 | 0.03 | 1.77 | 0.10 | 695 | 367 | 0.67 | 1.28 | 1 | 0 | 0 |
56 | 2009 | Wyoming | USA | 523949 | 6.08 | 55443 | 30.10 | 637 | 51990 | 163400 | 24.2 | 86.87 | 0.88 | 7.59 | 2.03 | 0.72 | 0.03 | 1.77 | 0.10 | 430 | 415 | 0.75 | 0.78 | 1 | 0 | 0 |
56 | 2010 | Wyoming | USA | 563626 | 7.21 | 44382 | 30.77 | 733 | 56573 | 184400 | 25.7 | 85.85 | 0.77 | 8.91 | 2.09 | 0.76 | 0.06 | 1.47 | 0.08 | 107 | 106 | 0.24 | 0.24 | 1 | 0 | 0 |
56 | 2011 | Wyoming | USA | 579679 | 7.69 | 37604 | 30.94 | 789 | 58840 | 194800 | 25.6 | 84.47 | 1.03 | 9.61 | 1.90 | 0.89 | 0.07 | 2.01 | 0.02 | 94 | 92 | 0.24 | 0.25 | 1 | 0 | 0 |
56 | 2012 | Wyoming | USA | 579679 | 7.69 | 38682 | 30.94 | 789 | 58840 | 194800 | 25.6 | 84.47 | 1.03 | 9.61 | 1.90 | 0.89 | 0.07 | 2.01 | 0.02 | 67 | 65 | 0.17 | 0.17 | 1 | 0 | 0 |
56 | 2013 | Wyoming | USA | 579679 | 7.69 | 39761 | 30.94 | 789 | 58840 | 194800 | 25.6 | 84.47 | 1.03 | 9.61 | 1.90 | 0.89 | 0.07 | 2.01 | 0.02 | 87 | 87 | 0.22 | 0.22 | 1 | 0 | 0 |
56 | 2014 | Wyoming | USA | 579679 | 7.69 | 40840 | 30.94 | 789 | 58840 | 194800 | 25.6 | 84.47 | 1.03 | 9.61 | 1.90 | 0.89 | 0.07 | 2.01 | 0.02 | 94 | 94 | 0.23 | 0.23 | 1 | 0 | 0 |
56 | 2015 | Wyoming | USA | 579679 | 7.69 | 41918 | 30.94 | 789 | 58840 | 194800 | 25.6 | 84.47 | 1.03 | 9.61 | 1.90 | 0.89 | 0.07 | 2.01 | 0.02 | 243 | 241 | 0.57 | 0.58 | 1 | 0 | 0 |
56 | 2016 | Wyoming | USA | 579679 | 7.69 | 42997 | 30.94 | 789 | 58840 | 194800 | 25.6 | 84.47 | 1.03 | 9.61 | 1.90 | 0.89 | 0.07 | 2.01 | 0.02 | 382 | 377 | 0.88 | 0.89 | 1 | 0 | 0 |
This notation requires us to specify things as follows: object[object$variable condition equivalency,]
.
Let’s return to our fourth question: How much higher (as a percentage) was Massachusetts’ number of evictions in 2016, relative to Florida’s?
Here’s how I can approach the question:
Call the data from my data frame.
Apply a filter to cover the desired timeframe.
Optional: Select the relevant columns to make it easier to read the output.
Summarize the data by creating a new variable called pct_chg
that uses subscripting to (1) manually subtract Massachusetts’ number of evictions from Florida’s, (2) divide the result by Florida’s number, and (3) multiply that by 100.
Here’s what that logic looks like as code:
us_states %>%
filter(year==2016) %>%
select(year, name, evictions) %>%
summarize(pct_chg=((evictions[name=="Massachusetts"]-evictions[name=="Florida"])/abs(evictions[name=="Florida"]))*100)
pct_chg |
---|
-78.06605 |
Because we are piping the information, we do not need to enter the name of the data frame object since the piping functionality will automatically fill that information in for us, referencing the output created in the step immediately preceding that final step.
As we can see, the number of evictions in Massachusetts was 78.1% lower than the number in Florida in 2016.
Of course, we need to keep in mind that Florida’s population is much larger than Massachusetts’. Thus, it would usually make the most sense to calculate the difference in their rates for a between-states comparison. If we have rates that are already expressed as percentages, it also makes the most sense to calculate the difference in percentage points (i.e., noting that 30% is a 20 percentage point increase from 10%) rather than trying to calculate a percent difference between percents (i.e., noting that 30% is a 200% increase from 10%).
The fifth question I would like to ask my data is: What was the three-year percent change in eviction rate for each year between 2013 and 2016 in Massachusetts and Florida?
To answer this question, the first thing I would need to do is filter out all the states but Florida and Massachusetts. Then, I’d want to group the data by state so I can calculate the percent changes relative to the previous years for that state (and not other states). Then, I would want to order the data sequentially (to make sure the previous year is always the preceding row). Finally, I’d need to tell R to perform our percent change calculation using information from the current row as Y2
and the information from three rows back as Y1
.
To perform that last step, we’ll use the dplyr::lag()
function.
lag()
allows us to look up a value from n
elements (or rows in the case of a data frame) back in object x
with the syntax, lag(x, n)
. For example, let’s consider the following code:
lag(c(1, 5, 7, 10, 13), 3)
## [1] NA NA NA 1 5
What we are getting is NA
for the first three elements (because there were no values that far back, relative to those elements). The fourth value in our lagged vector was the first in our original vector, and the fifth lagged value was the second original value.
Let’s go back to our original question: What was the three-year percent change in eviction rate for each year between 2013 and 2016 in Massachusetts and Florida?
Here’s what I’d need to do to answer that question:
Call the data from my data frame.
Apply a filter to cover just the desired states.
Optional: Select only the variables of interest to me, so I can more quickly see the answer and increase computational efficiency.
Group the data based on my grouping variable of interest.
Sort the data by year
(and add the .by_group=TRUE
argument to the sorting function so that we’re only sorting in-group and thus keeping our state observations in sequential order).
Compute a new variable (we’ll call it evictions_pct_chg
) using the percent change formula we’re already familiar with (((Y2-Y1)/Y1)*100
), using the present value as Y2
and the value three years earlier as Y1
.
Optional: Apply a second filter that only includes values within the desired timeframe. (We do this at the end since we need data from the earlier years to compute our new evictions_pct_chg
variable.)
Here’s what that logic looks like as code:
us_states %>%
filter(name %in% c("Massachusetts", "Florida")) %>%
select(year, name, `eviction-rate`) %>%
group_by(name) %>%
arrange(year, .by_group=TRUE) %>%
mutate(evictions_pct_chg=((`eviction-rate`-lag(`eviction-rate`, 3))/lag(`eviction-rate`, 3))*100) %>%
filter(between(year, 2013, 2016))
year | name | eviction-rate | evictions_pct_chg |
---|---|---|---|
2013 | Florida | 2.55 | -17.2077922 |
2014 | Florida | 2.48 | -21.7665615 |
2015 | Florida | 2.53 | -15.6666667 |
2016 | Florida | 2.53 | -0.7843137 |
2013 | Massachusetts | 1.86 | -7.0000000 |
2014 | Massachusetts | 1.81 | -15.8139535 |
2015 | Massachusetts | 1.76 | -13.7254902 |
2016 | Massachusetts | 1.52 | -18.2795699 |
As we can see, Florida had the higher three-year percent change in 2015.
Data journalists are often interested in looking at relationships between variables.
Scatterplots are generally best for spotting potential relationships, but it is sometimes helpful to also quickly calculate a linear correlation using Pearson’s correlation coefficient to gauge the strength of the linear relationship. (Important note: Relationships can be curvilinear in nature. How to evaluate that is better covered in a formal statistics course.)
To calculate a simple, linear correlation, we just use the base::cor()
function, which is loaded automatically in R. With cor()
, we just specify the two variables of interest as separate arguments (e.g., cor(dataset$variable1, dataset$variable2)
).
cor()
We could also calculate multiple correlations at the same time by incorporating cor()
into our summarize()
function to calculate the linear correlations between (1) population size and the rent-burden and (2) population size and the eviction rate:
us_states %>%
summarize(cor_pop_burden=cor(population, `rent-burden`), cor_pop_evicrate=cor(population, `eviction-rate`, use="pairwise.complete.obs"))
cor_pop_burden | cor_pop_evicrate |
---|---|
0.2968485 | 0.0279709 |
Because some of our observations had an NA
value for eviction-rate
, we need to tell cor()
to only evaluate observations that have non-missing data for those two variables via the use="pairwise.complete.obs"
argument. If we omit that argument, we would just get NA
as the value for the correlation.
What the above output tells us is that, when looking at every observation in our data frame, there is a positive (the r coefficient has a positive value) but weak (the r coefficient is less than +/-0.3) relationship between population size and the rent burden, and between the population size and the eviction rate. Thus, an increase in population size results in a linear increase of some proportion in the rent burden (or vice versa) and eviction rate (or vice versa) some of the time.
The nice thing about using our multi-step workflow (instead of just running cor()
in isolation) is that it allows us to apply filters that can help narrow our inquiry (correlation in recent years vs. many years ago) or to calculate the correlation for computed (mutated) variables.
As a reminder, there is a big difference between correlation and causation.
Even if we had found a strong correlation between population size and rent burden, the causal variable might be very different. (Put differently, it’s not the size of the population that causes a place to have a high rent burden but some other variable that is common in cities that have larger populations.)
And, again, it is crucial to stress here that we are only measuring a linear correlation (using Pearson’s r) right now. A strong relationship might exist but simply operate in a non-linear way (e.g., increase Y by a low magnitude at the low end of X but then increase Y at increasingly high magnitudes as you move up the values of X). Thus, be careful with how you interpret such statistics.
It is not uncommon for a dataset to include typos (e.g., Massachusetts
being expressed as Masachussets
) or to have variations in capitalization (e.g., Massachusetts
and massachusetts
). To ensure that those values are grouped correctly, we would want to give it a consistent label.
One easy way to do that is to use the dplyr::recode()
function to recode some of our values. With the recode()
function, we first specify the variable we want to inspect and the values we’d like to recode in the following format: OriginalValue = ReplacedValue
. Within a piped operation, we will typically pair it with the mutate()
function (thereby making recode()
the operation used to compute the new variable).
This dataset is in good shape and we do not need to recode any of the information. Thus, for demonstrative purposes, I would like to recode Massachusetts
to The Best Commonwealth
and Kentucky
as The Worst Commonwealth
. (Can you guess the other Commonwealths in the U.S.
Here’s how I could do that:
us_states %>%
filter(name %in% c("Kentucky", "Massachusetts"), year == 2016) %>%
mutate(new_name = recode(name, "Massachusetts" = "The Best Commonwealth", "Kentucky" = "The Worst Commonwealth")) %>%
select(year, name, new_name)
year | name | new_name |
---|---|---|
2016 | Kentucky | The Worst Commonwealth |
2016 | Massachusetts | The Best Commonwealth |
There are two important things to note here:
First, I could easily just replace the values in the original variable (name
) by simply using mutate to compute a variable called name
(rather than new_name
). I often like to create new variables when I make changes, but you do not have to.
Second, by default, recode()
will keep the original value for non-matching values (e.g., Alabama
stays as Alabama
). We can also tell recode()
to replace non-matching values with a specific string (or operation) by appending the .default="REPLACEMENT STRING"
argument. We can do the same for missing values (NA
) by appending the .missing="REPLACEMENT STRING"
argument.
In instances where we want to standardize the case (e.g., make everything lower- or upper-case), we can also very simply use the tolower()
and toupper()
functions within our mutate()
operation.
The tolower()
and toupper()
functions only require a single argument: the vector to apply the case change to.
For example, we can easily make all of our state names lower-case like this:
us_states %>%
head() %>%
mutate(name=tolower(name))
GEOID | year | name | parent-location | population | poverty-rate | renter-occupied-households | pct-renter-occupied | median-gross-rent | median-household-income | median-property-value | rent-burden | pct-white | pct-af-am | pct-hispanic | pct-am-ind | pct-asian | pct-nh-pi | pct-multiple | pct-other | eviction-filings | evictions | eviction-rate | eviction-filing-rate | low-flag | imputed | subbed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | 2000 | alabama | USA | 4447100 | 16.10 | 255910 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 7585 | 4583 | 1.79 | 2.96 | 0 | 0 | 0 |
01 | 2001 | alabama | USA | 4447100 | 16.10 | 486061 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 20828 | 10106 | 2.08 | 4.29 | 0 | 0 | 0 |
01 | 2002 | alabama | USA | 4447100 | 16.10 | 495329 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 21070 | 9603 | 1.94 | 4.25 | 0 | 0 | 0 |
01 | 2003 | alabama | USA | 4447100 | 16.10 | 456138 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 15750 | 7379 | 1.62 | 3.45 | 0 | 0 | 0 |
01 | 2004 | alabama | USA | 4447100 | 16.10 | 446058 | 27.54 | 447 | 34135 | 85100 | 24.8 | 70.29 | 25.86 | 1.71 | 0.49 | 0.7 | 0.02 | 0.88 | 0.06 | 13508 | 6170 | 1.38 | 3.03 | 0 | 0 | 0 |
01 | 2005 | alabama | USA | 4633360 | 12.86 | 449922 | 29.24 | 621 | 41216 | 111900 | 29.3 | 68.50 | 26.01 | 2.81 | 0.46 | 1.0 | 0.03 | 1.10 | 0.08 | 15426 | 8300 | 1.84 | 3.43 | 0 | 0 | 0 |