Intermediate Exploratory Data Analysis in R

Introduction

In this tutorial, we will be applying some of the tidyverse functions to an initial exploratory data analysis of state-level data from the Eviction Lab. Specifically, we’ll be learning how to:

  • Use the %in% operator.

  • Perform partial string matching.

  • Compute your own variables, including per capita rates and percent changes.

  • Use the subscript notation to pull out specific information.

  • Perform lagged calculations like year-over-year changes.

  • Calculate the linear correlation between two variables.

  • Recode values.

The Dataset

The Eviction Lab has collected data from 80 million records around the country and you can read about the methodology here. As always, it is helpful to review the data dictionary to understand the meaning of the different variables (what each variable name corresponds to). If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

Pre-Analysis Steps

Load the data

The first step, of course, is to read in the data. Like before, we’ll use the readr::read_csv() function:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
us_states <- read_csv("https://dds.rodrigozamith.com/files/evictions_us_states.csv")
## Rows: 867 Columns: 27

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): GEOID, name, parent-location
## dbl (24): year, population, poverty-rate, renter-occupied-households, pct-re...

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Confirm it imported correctly

Normally, we would make sure the data imported correctly. Because we’ve previously evaluated the data from this data source, we can just take a quick look at the first few observations to confirm that the data were imported correctly.

head(us_states)
GEOID year name parent-location population poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-nh-pi pct-multiple pct-other eviction-filings evictions eviction-rate eviction-filing-rate low-flag imputed subbed
01 2000 Alabama USA 4447100 16.10 255910 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 7585 4583 1.79 2.96 0 0 0
01 2001 Alabama USA 4447100 16.10 486061 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 20828 10106 2.08 4.29 0 0 0
01 2002 Alabama USA 4447100 16.10 495329 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 21070 9603 1.94 4.25 0 0 0
01 2003 Alabama USA 4447100 16.10 456138 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 15750 7379 1.62 3.45 0 0 0
01 2004 Alabama USA 4447100 16.10 446058 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 13508 6170 1.38 3.03 0 0 0
01 2005 Alabama USA 4633360 12.86 449922 29.24 621 41216 111900 29.3 68.50 26.01 2.81 0.46 1.0 0.03 1.10 0.08 15426 8300 1.84 3.43 0 0 0

Just like before, it appears we imported the data correctly. Each observation represents a different state (name) at a different point in time (year).

Filtering Strings

Using the %in% operator

While most of the data crunching we do as data journalists involves numbers, we sometimes have to work with strings, too.

So, I’d like to begin this more advanced data analysis by asking my data: Which New England state had the highest eviction rate in 2016?

Our dataset does not have a variable covering the region represented by the state. Instead, we’ll have to create our own filter that only includes those states. We can turn to the U.S. Census Bureau-designated regions and divisions for help in determining which states are part of New England.

Rather than creating a filter with a bunch of different OR conditions, I can just use the %in% operator and pair it with a vector of values that I would like to include. Put another way, I can use the filter() function and supply the conditional as variable %in% c("vector", "with", "multiple", "values").

With this information in mind, here’s what I’d need to do to answer that question:

  1. Call the data from my data frame.

  2. Apply a filter to cover the desired year and the desired states.

  3. Sort the data in descending order based on the variable of interest.

  4. Optional: Select only the variables of interest to me, so I can more quickly see the answer.

Now, let’s translate that logic into code:

us_states %>%
  filter(name %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") & year == 2016) %>%
  arrange(desc(`eviction-rate`)) %>%
  select(year, name, `eviction-rate`)
year name eviction-rate
2016 Rhode Island 3.07
2016 Connecticut 3.04
2016 Maine 2.26
2016 New Hampshire 1.70
2016 Massachusetts 1.52
2016 Vermont 0.09

As we can see, Rhode Island was the New England state with the highest eviction rate in 2016.

Using str_starts() and str_ends()

Similarly, sometimes we want to create a filter for our data so we only include observations that begin with or end with certain text.

In this case, I’d like to ask my data: Which of the Virginias had the highest average eviction rate between 2013 and 2016?

One approach would be to create a filter with just those two state names. However, I can also use the stringr::str_ends() function to perform a search that looks for partial matches by evaluating the end of the string.

stringr is automatically loaded by the tidyverse package.

The arguments I need to provide str_ends() function are (1) the variable I’m evaluating (the string argument, which comes first) and (2) the pattern or value that must be present at the end for it to yield a TRUE response (the pattern argument, which comes second). As a reminder, the TRUE responses are the ones included in the results when we apply a filter with the filter() function.

So, let’s try to create a filter (filter()) that only includes observations where the name variable ends (str_ends()) with the string "Virginia":

us_states %>%
  filter(str_starts(name, "Virginia"))
GEOID year name parent-location population poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-nh-pi pct-multiple pct-other eviction-filings evictions eviction-rate eviction-filing-rate low-flag imputed subbed
51 2000 Virginia USA 7078515 9.59 664083 31.91 650 46677 125400 24.5 70.15 19.44 4.66 0.26 3.66 0.05 1.61 0.17 77522 40945 6.17 11.67 0 0 0
51 2001 Virginia USA 7078515 9.59 734527 31.91 650 46677 125400 24.5 70.15 19.44 4.66 0.26 3.66 0.05 1.61 0.17 83498 44970 6.12 11.37 0 0 0
51 2002 Virginia USA 7078515 9.59 761196 31.91 650 46677 125400 24.5 70.15 19.44 4.66 0.26 3.66 0.05 1.61 0.17 123429 49799 6.54 16.22 0 0 0
51 2003 Virginia USA 7078515 9.59 885087 31.91 650 46677 125400 24.5 70.15 19.44 4.66 0.26 3.66 0.05 1.61 0.17 127023 52790 5.96 14.35 0 0 0
51 2004 Virginia USA 7078515 9.59 900780 31.91 650 46677 125400 24.5 70.15 19.44 4.66 0.26 3.66 0.05 1.61 0.17 153631 55869 6.20 17.06 0 0 0
51 2005 Virginia USA 7721730 7.16 914428 30.84 931 60316 247100 28.7 66.98 19.31 6.67 0.23 4.78 0.07 1.70 0.25 156448 51745 5.66 17.11 0 0 0
51 2006 Virginia USA 7721730 7.16 928075 30.84 931 60316 247100 28.7 66.98 19.31 6.67 0.23 4.78 0.07 1.70 0.25 158878 50920 5.49 17.12 0 0 0
51 2007 Virginia USA 7721730 7.16 613794 30.84 931 60316 247100 28.7 66.98 19.31 6.67 0.23 4.78 0.07 1.70 0.25 66538 37676 6.14 10.84 0 0 0
51 2008 Virginia USA 7721730 7.16 623822 30.84 931 60316 247100 28.7 66.98 19.31 6.67 0.23 4.78 0.07 1.70 0.25 65233 42797 6.86 10.46 0 0 0
51 2009 Virginia USA 7721730 7.16 629765 30.84 931 60316 247100 28.7 66.98 19.31 6.67 0.23 4.78 0.07 1.70 0.25 62974 38357 6.09 10.00 0 0 0
51 2010 Virginia USA 8001024 7.79 685038 32.75 1060 63636 249700 29.9 64.82 19.04 7.90 0.26 5.45 0.06 2.27 0.19 94057 45125 6.59 13.73 0 0 0
51 2011 Virginia USA 8256630 8.22 866175 33.82 1116 65015 245000 29.9 63.44 18.90 8.59 0.21 5.93 0.06 2.66 0.22 144511 50013 5.77 16.68 0 0 0
51 2012 Virginia USA 8256630 8.22 881194 33.82 1116 65015 245000 29.9 63.44 18.90 8.59 0.21 5.93 0.06 2.66 0.22 138282 48198 5.47 15.69 0 0 0
51 2013 Virginia USA 8256630 8.22 906176 33.82 1116 65015 245000 29.9 63.44 18.90 8.59 0.21 5.93 0.06 2.66 0.22 151841 51134 5.64 16.76 0 0 0
51 2014 Virginia USA 8256630 8.22 917828 33.82 1116 65015 245000 29.9 63.44 18.90 8.59 0.21 5.93 0.06 2.66 0.22 147094 51101 5.57 16.03 0 0 0
51 2015 Virginia USA 8256630 8.22 996441 33.82 1116 65015 245000 29.9 63.44 18.90 8.59 0.21 5.93 0.06 2.66 0.22 144220 50213 5.04 14.47 0 0 0
51 2016 Virginia USA 8256630 8.22 1012175 33.82 1116 65015 245000 29.9 63.44 18.90 8.59 0.21 5.93 0.06 2.66 0.22 146534 51821 5.12 14.48 0 0 0

We are using str_ends() (which comes with stringr package) and not ends_with() (which comes with the sister dplyr package). dplyr::ends_with() is a separate function used to select multiple variables from a data frame based on patterns in the variable names, and can be useful when you have a large dataset with hundreds of variables and want a sub-group of them. stringr::str_ends() evaluates a given string or vector to see if it matches the condition, yielding a TRUE or FALSE response that can serve as the basis of a filter.

Unlike prior instances of dplyr::filter() where we specified a variable, an evaluation operator, and the condition (e.g., name == "Virginia"), the str_ends() function expects a different syntax. This is common in R, especially when you are using functions from different packages. Thus, be sure to read each function’s documentation to understand what it’s looking for.

We can similarly look at the start of a string by using the str_starts() function, which also comes with R. For example, let’s include only the states (name) that begin with the string "North":

us_states %>%
  filter(str_starts(name, "North"))
GEOID year name parent-location population poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-nh-pi pct-multiple pct-other eviction-filings evictions eviction-rate eviction-filing-rate low-flag imputed subbed
37 2000 North Carolina USA 8049313 12.28 393465 30.64 548 39184 108300 24.3 70.16 21.41 4.71 1.18 1.40 0.04 0.99 0.11 26050 22292 5.67 6.62 0 0 0
37 2001 North Carolina USA 8049313 12.28 495885 30.64 548 39184 108300 24.3 70.16 21.41 4.71 1.18 1.40 0.04 0.99 0.11 49761 18780 3.79 10.03 0 0 0
37 2002 North Carolina USA 8049313 12.28 566782 30.64 548 39184 108300 24.3 70.16 21.41 4.71 1.18 1.40 0.04 0.99 0.11 88107 18880 3.33 15.55 0 0 0
37 2003 North Carolina USA 8049313 12.28 618050 30.64 548 39184 108300 24.3 70.16 21.41 4.71 1.18 1.40 0.04 0.99 0.11 92789 19057 3.08 15.01 0 0 0
37 2004 North Carolina USA 8049313 12.28 734415 30.64 548 39184 108300 24.3 70.16 21.41 4.71 1.18 1.40 0.04 0.99 0.11 114477 36944 5.03 15.59 0 0 0
37 2005 North Carolina USA 9045705 11.06 1036534 31.95 702 45069 143700 29.1 67.45 20.94 7.00 1.08 1.90 0.04 1.39 0.20 190778 50800 4.90 18.41 0 0 0
37 2006 North Carolina USA 9045705 11.06 1063201 31.95 702 45069 143700 29.1 67.45 20.94 7.00 1.08 1.90 0.04 1.39 0.20 177708 51731 4.87 16.71 0 0 0
37 2007 North Carolina USA 9045705 11.06 1082018 31.95 702 45069 143700 29.1 67.45 20.94 7.00 1.08 1.90 0.04 1.39 0.20 156923 63019 5.82 14.50 0 0 0
37 2008 North Carolina USA 9045705 11.06 1114318 31.95 702 45069 143700 29.1 67.45 20.94 7.00 1.08 1.90 0.04 1.39 0.20 166874 71269 6.40 14.98 0 0 0
37 2009 North Carolina USA 9045705 11.06 1146099 31.95 702 45069 143700 29.1 67.45 20.94 7.00 1.08 1.90 0.04 1.39 0.20 160703 67767 5.91 14.02 0 0 0
37 2010 North Carolina USA 9535483 12.41 1173438 33.30 759 46450 153600 30.3 65.27 21.18 8.39 1.14 2.17 0.06 1.63 0.16 167293 66669 5.68 14.26 0 0 0
37 2011 North Carolina USA 9845333 12.79 1203775 34.87 797 46868 154900 30.3 64.24 21.17 8.84 1.10 2.46 0.05 1.94 0.20 165870 63278 5.26 13.78 0 0 0
37 2012 North Carolina USA 9845333 12.79 1234112 34.87 797 46868 154900 30.3 64.24 21.17 8.84 1.10 2.46 0.05 1.94 0.20 150500 57668 4.67 12.20 0 0 0
37 2013 North Carolina USA 9845333 12.79 1264451 34.87 797 46868 154900 30.3 64.24 21.17 8.84 1.10 2.46 0.05 1.94 0.20 156613 56020 4.43 12.39 0 0 0
37 2014 North Carolina USA 9845333 12.79 1294790 34.87 797 46868 154900 30.3 64.24 21.17 8.84 1.10 2.46 0.05 1.94 0.20 156090 60365 4.66 12.06 0 0 0
37 2015 North Carolina USA 9845333 12.79 1325128 34.87 797 46868 154900 30.3 64.24 21.17 8.84 1.10 2.46 0.05 1.94 0.20 148593 59940 4.52 11.21 0 0 0
37 2016 North Carolina USA 9845333 12.79 1355467 34.87 797 46868 154900 30.3 64.24 21.17 8.84 1.10 2.46 0.05 1.94 0.20 147038 62539 4.61 10.85 0 0 0
38 2000 North Dakota USA 642200 11.86 85853 33.39 412 34604 74400 22.3 91.74 0.59 1.21 4.79 0.56 0.03 1.04 0.04 408 NA NA 0.48 0 0 0
38 2001 North Dakota USA 642200 11.86 86993 33.39 412 34604 74400 22.3 91.74 0.59 1.21 4.79 0.56 0.03 1.04 0.04 456 NA NA 0.52 0 0 0
38 2002 North Dakota USA 642200 11.86 88132 33.39 412 34604 74400 22.3 91.74 0.59 1.21 4.79 0.56 0.03 1.04 0.04 515 NA NA 0.58 0 0 0
38 2003 North Dakota USA 642200 11.86 89272 33.39 412 34604 74400 22.3 91.74 0.59 1.21 4.79 0.56 0.03 1.04 0.04 474 NA NA 0.53 0 0 0
38 2004 North Dakota USA 642200 11.86 90411 33.39 412 34604 74400 22.3 91.74 0.59 1.21 4.79 0.56 0.03 1.04 0.04 522 NA NA 0.58 0 0 0
38 2005 North Dakota USA 639725 7.37 91551 34.32 535 45140 104300 25.3 89.62 0.91 1.97 5.25 0.82 0.05 1.34 0.05 503 NA NA 0.55 0 0 0
38 2006 North Dakota USA 639725 7.37 92691 34.32 535 45140 104300 25.3 89.62 0.91 1.97 5.25 0.82 0.05 1.34 0.05 560 NA NA 0.60 0 0 0
38 2007 North Dakota USA 639725 7.37 93830 34.32 535 45140 104300 25.3 89.62 0.91 1.97 5.25 0.82 0.05 1.34 0.05 501 NA NA 0.53 0 0 0
38 2008 North Dakota USA 639725 7.37 94970 34.32 535 45140 104300 25.3 89.62 0.91 1.97 5.25 0.82 0.05 1.34 0.05 473 NA NA 0.50 0 0 0
38 2009 North Dakota USA 639725 7.37 96109 34.32 535 45140 104300 25.3 89.62 0.91 1.97 5.25 0.82 0.05 1.34 0.05 495 NA NA 0.52 0 0 0
38 2010 North Dakota USA 672591 7.45 97249 34.58 610 51641 123900 24.9 88.91 1.15 2.00 5.29 1.02 0.04 1.54 0.05 533 NA NA 0.55 0 0 0
38 2011 North Dakota USA 721640 7.23 101400 35.86 709 57181 153800 24.9 87.05 1.60 2.85 5.20 1.22 0.04 2.00 0.05 618 NA NA 0.61 0 0 0
38 2012 North Dakota USA 721640 7.23 105552 35.86 709 57181 153800 24.9 87.05 1.60 2.85 5.20 1.22 0.04 2.00 0.05 680 NA NA 0.64 0 0 0
38 2013 North Dakota USA 721640 7.23 109703 35.86 709 57181 153800 24.9 87.05 1.60 2.85 5.20 1.22 0.04 2.00 0.05 790 NA NA 0.72 0 0 0
38 2014 North Dakota USA 721640 7.23 113854 35.86 709 57181 153800 24.9 87.05 1.60 2.85 5.20 1.22 0.04 2.00 0.05 1054 NA NA 0.93 0 0 0
38 2015 North Dakota USA 721640 7.23 118006 35.86 709 57181 153800 24.9 87.05 1.60 2.85 5.20 1.22 0.04 2.00 0.05 1187 NA NA 1.01 0 0 0
38 2016 North Dakota USA 721640 7.23 122157 35.86 709 57181 153800 24.9 87.05 1.60 2.85 5.20 1.22 0.04 2.00 0.05 1339 NA NA 1.10 0 0 0

The pattern argument accepts regular expressions, which allow for more complex pattern matching. (For example, maybe you are looking for a string that only has letters in it, and no numbers.) In practice, this means that commonly used characters like the dollar sign ($) have a special meaning in those patterns (in this case, $ indicates the match should occur at the end of the string — which is especially useless in our context). That can be a source of frustration if you’re just looking for a dollar value in a string. To get around that, enclose the pattern you enter within the coll() function like so: str_starts(amount, coll("$300")). This would bypass any regular expression matching, and simply interpret your text literally (as a human would). Put another way, that example would capture an amount like “$300,000” (assuming it is being stored as a string in your data frame) because it started with “$300”.

So, let’s return to our original question: Which of the Virginias had the highest average eviction rate between 2013 and 2016?

Here’s what I’d need to do to answer that question:

  1. Call the data from my data frame.

  2. Apply a filter to cover the observations ending with my desired string and matching the years within my desired range.

  3. Group the data based on the grouping variable of interest.

  4. Calculate the summary statistic for the mean of the desired variable.

We can translate that into code like so:

us_states %>%
  filter(str_ends(name, "Virginia") & year >= 2013 & year <= 2016) %>%
  group_by(name) %>%
  summarize(mean_eviction_rate=mean(`eviction-rate`))
name mean_eviction_rate
Virginia 5.3425
West Virginia 3.6525

As we can see here, the state of Virginia had the higher eviction rate during that period of time.

Advanced string matching

What if we want to just filter observations based on some text that is present in the middle of a string? Here, we can use stringr::str_detect().

str_detect() allows you to match text based on patterns, or regular expressions. These patterns can get very complicated (but are quite powerful), and you can read more about them here.

For the sake of simplicity, we won’t cover patterns here. Instead, we’ll just use str_detect() to look for entries that include a specified string anywhere within a given variable. For example, we can filter in any state with “y” in its name and count the number of unique states like so:

us_states %>%
  filter(str_detect(name, "y")) %>%
  count(name)
name n
Kentucky 17
Maryland 17
New Jersey 17
Pennsylvania 17
Wyoming 17

Be careful when using punctuation with str_detect(). First, the function is case-sensitive by default, so searching for “Y” will give you different results than “y” (i.e., New York is covered by “Y” but not “y”). Second, the function may think you are giving it instructions for pattern-seeking. For example, “^North” tells it to match only when “North” appears at the start of the sentence—just like str_starts(). Any special characters need to be escaped, or you can use the coll() function as described above. You can read about that in the documentation for the function.

Calculating Your Own Rates

The third question I’d like to ask my data is: What were the eviction rates in Massachusetts between 2010 and 2016, expressed on a per 100,000 people basis?

We already have data about eviction rates in the eviction-rate variable. However, those data are on a per 100 renter homes basis. In order to address that question, I would need to calculate my own rate. Thankfully, I have the variables necessary to do that (evictions and population).

To compute a new variable, we can use the dplyr::mutate() function. mutate() operates just like summarize() in the sense that you give it a variable name and provide the operation necessary to calculate it. However, the key difference is that the variables you create are simply appended to the existing data frame—they do not reduce the data frame into a set of summary statistics.

A rate like the one we are interested in can be easily computed through the following calculation: X / Y (where X refers to the number of evictions and Y to the population). To get the per 100,000 people basis, we would multiply that result by 100,000.

So, here’s what I’d need to do to answer that third question:

  1. Call the data from my data frame.

  2. Apply a filter to cover the desired state and the timeframe.

  3. Create a new variable (we’ll call it eviction_rate_pop) based on our rate calculation.

  4. Optional: Select only the variables of interest to me, so I can more quickly see the answer.

And here is how I can translate that into code:

us_states %>%
  filter(name == "Massachusetts" & between(year, 2010, 2016)) %>%
  mutate(eviction_rate_pop = (evictions/population)*100000) %>%
  select(year, name, population, evictions, eviction_rate_pop)
year name population evictions eviction_rate_pop
2010 Massachusetts 6547629 18803 287.1727
2011 Massachusetts 6705586 20452 304.9994
2012 Massachusetts 6705586 19648 293.0094
2013 Massachusetts 6705586 18131 270.3865
2014 Massachusetts 6705586 18289 272.7428
2015 Massachusetts 6705586 17986 268.2241
2016 Massachusetts 6705586 15708 234.2525

We can also use the dplyr::between() function to replicate the dual conditions of year >= 2010 and year ⇐ 2016. They are functionally equivalent.

Looking at this output, I can see that there were 270 evictions for every 100,000 people who lived in Massachusetts in 2013. (And, for other years between 2010 and 2016, too!)

Things to be mindful of

One thing to be mindful of is that the population statistics in our dataset change pretty infrequently. For example, we have the same population value for all years between 2011-2016. That’s probably because the Eviction Lab team are relying on the most comprehensive counts from the decennial U.S. Census.

However, the U.S. Census also offers less precise, but more frequently updated, annual data from its annual American Community Survey. If we wanted more recent population counts, we would thus need to merge the Eviction Lab data with data from the American Community Survey. This is possible, but there may be a good reason for keeping the less-recent counts, and we should therefore see if the methodological details explain the reason for that decision.

A second thing to be mindful of is that deciding the basis for a rate (e.g., 1,000 vs. 100,000) is more of an art than a science. It requires an understanding of the context and the base rates that your audiences are most familiar with. For example, we tend to see rates for states and cities expressed on a per 100,000 people basis, and it is thus an already familiar anchor for most folks. As such, that’s often a helpful basis to use. But that may not be the case in all contexts.

Alternatively, the Eviction Lab likely provides its rates on a per-100 renters basis because it is easy to also present that as a percentage (e.g., 2.1 per 100 can be translated to 2.1%). In fact, they present that rate as a percentage on the visuals on their website.

The key is to use a basis (‘per’ number) that you believe would be most informative to your audience—and not the one that makes a statistic seem most or least alarming.

Calculating a Percent Change

A fourth question I would like to ask my data is: How much higher (as a percentage) was Massachusetts’ number of evictions in 2016, relative to Florida’s?

As a reminder, we can calculate a percent change as follows: (Y2-Y1)/abs(Y1)*100. (The abs() function allows us to get the absolute value of a number.)

Note that there is a difference between percent change and percent difference. People are typically more familiar with percent change (% increase or % decrease), so you’ll typically see journalists use that calculation. Additionally, keep in mind that it matters which entity (e.g., Massachusetts or Florida) you assign to Y2. Typically, the value of greatest interest (or the newest value) should be Y2. When in doubt, double-check your calculations against an online calculator, like the ones linked to here.

I could apply that formula as part of a two-step operation where I manually look up Massachusetts’ number (Y2) of evictions as well as Florida’s (Y1), and then (1) manually subtract Massachusetts’ number from Florida’s, (2) divide the result by Florida’s number, and (3) multiply that by 100.

However, we can also do this in a single operation by calculating a summary statistic using that equation. In order to get the data necessary for that operation, we can use the subscripting approach that is native to R.

Subscripting

R allows us to pull out specific information from its outputs, which can be handy when you want to perform calculations using filtered or mutated data. We have already performed some calculations with different dplyr functions, but this is a good opportunity to demonstrate an alternative that is sometimes necessary for functions that don’t play nicely with tidyverse or to do some things we want within tidyverse.

First, let’s try getting information from a vector. As a reminder, we can call information from a variable in a data frame with the object$variable notation, such as when we run:

head(us_states$evictions)
## [1]  4583 10106  9603  7379  6170  8300

Let’s say we wanted to get just the first value from that vector (4583). Here’s where a subscript would be useful. Subscripts allow you to call specific elements from an object by specifying it in brackets. As such, us_states$evictions[1] would produce only the first element from that vector.

us_states$evictions[1]
## [1] 4583

If your object is a data frame, which has observations (rows) and variables (columns), we subscript using matrix notation [(observation), (variable)] (or [(row), (column)]; note the comma in both. For example, if I wanted to get the value in the first row and third column of our us_states data frame, I’d write:

us_states[1,3]
name
Alabama

I could also provide just one of those values (either observation or variable), but keep the comma, in order to get the information from my desired observation or variable. For example, here’s the entire first row in our dataset.

us_states[1,]
GEOID year name parent-location population poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-nh-pi pct-multiple pct-other eviction-filings evictions eviction-rate eviction-filing-rate low-flag imputed subbed
01 2000 Alabama USA 4447100 16.1 255910 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 7585 4583 1.79 2.96 0 0 0

Within the tidyverse, we could use slice(us_states, 1) to replicate us_states[1,] and select(us_states, 3) to replicate us_states[,3].

Notably, we can also use subscripting to filter information based on a condition (like we do with the filter() function). For example, if I just wanted rows that had the name variable be equivalent to “Wyoming”, I could write:

us_states[us_states$name == "Wyoming",]
GEOID year name parent-location population poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-nh-pi pct-multiple pct-other eviction-filings evictions eviction-rate eviction-filing-rate low-flag imputed subbed
56 2000 Wyoming USA 493782 11.42 28647 30.01 437 37892 96600 22.5 88.86 0.71 6.41 2.07 0.54 0.05 1.25 0.10 32 31 0.11 0.11 1 0 0
56 2001 Wyoming USA 493782 11.42 29310 30.01 437 37892 96600 22.5 88.86 0.71 6.41 2.07 0.54 0.05 1.25 0.10 31 26 0.09 0.11 1 0 0
56 2002 Wyoming USA 493782 11.42 38257 30.01 437 37892 96600 22.5 88.86 0.71 6.41 2.07 0.54 0.05 1.25 0.10 152 72 0.19 0.40 1 0 0
56 2003 Wyoming USA 493782 11.42 49547 30.01 437 37892 96600 22.5 88.86 0.71 6.41 2.07 0.54 0.05 1.25 0.10 701 368 0.74 1.41 1 0 0
56 2004 Wyoming USA 493782 11.42 50530 30.01 437 37892 96600 22.5 88.86 0.71 6.41 2.07 0.54 0.05 1.25 0.10 929 496 0.98 1.84 1 0 0
56 2005 Wyoming USA 523949 6.08 51513 30.10 637 51990 163400 24.2 86.87 0.88 7.59 2.03 0.72 0.03 1.77 0.10 991 539 1.05 1.92 1 0 0
56 2006 Wyoming USA 523949 6.08 52496 30.10 637 51990 163400 24.2 86.87 0.88 7.59 2.03 0.72 0.03 1.77 0.10 872 487 0.93 1.66 1 0 0
56 2007 Wyoming USA 523949 6.08 53478 30.10 637 51990 163400 24.2 86.87 0.88 7.59 2.03 0.72 0.03 1.77 0.10 866 426 0.80 1.62 1 0 0
56 2008 Wyoming USA 523949 6.08 54460 30.10 637 51990 163400 24.2 86.87 0.88 7.59 2.03 0.72 0.03 1.77 0.10 695 367 0.67 1.28 1 0 0
56 2009 Wyoming USA 523949 6.08 55443 30.10 637 51990 163400 24.2 86.87 0.88 7.59 2.03 0.72 0.03 1.77 0.10 430 415 0.75 0.78 1 0 0
56 2010 Wyoming USA 563626 7.21 44382 30.77 733 56573 184400 25.7 85.85 0.77 8.91 2.09 0.76 0.06 1.47 0.08 107 106 0.24 0.24 1 0 0
56 2011 Wyoming USA 579679 7.69 37604 30.94 789 58840 194800 25.6 84.47 1.03 9.61 1.90 0.89 0.07 2.01 0.02 94 92 0.24 0.25 1 0 0
56 2012 Wyoming USA 579679 7.69 38682 30.94 789 58840 194800 25.6 84.47 1.03 9.61 1.90 0.89 0.07 2.01 0.02 67 65 0.17 0.17 1 0 0
56 2013 Wyoming USA 579679 7.69 39761 30.94 789 58840 194800 25.6 84.47 1.03 9.61 1.90 0.89 0.07 2.01 0.02 87 87 0.22 0.22 1 0 0
56 2014 Wyoming USA 579679 7.69 40840 30.94 789 58840 194800 25.6 84.47 1.03 9.61 1.90 0.89 0.07 2.01 0.02 94 94 0.23 0.23 1 0 0
56 2015 Wyoming USA 579679 7.69 41918 30.94 789 58840 194800 25.6 84.47 1.03 9.61 1.90 0.89 0.07 2.01 0.02 243 241 0.57 0.58 1 0 0
56 2016 Wyoming USA 579679 7.69 42997 30.94 789 58840 194800 25.6 84.47 1.03 9.61 1.90 0.89 0.07 2.01 0.02 382 377 0.88 0.89 1 0 0

This notation requires us to specify things as follows: object[object$variable condition equivalency,].

Returning to our question

Let’s return to our fourth question: How much higher (as a percentage) was Massachusetts’ number of evictions in 2016, relative to Florida’s?

Here’s how I can approach the question:

  1. Call the data from my data frame.

  2. Apply a filter to cover the desired timeframe.

  3. Optional: Select the relevant columns to make it easier to read the output.

  4. Summarize the data by creating a new variable called pct_chg that uses subscripting to (1) manually subtract Massachusetts’ number of evictions from Florida’s, (2) divide the result by Florida’s number, and (3) multiply that by 100.

Here’s what that logic looks like as code:

us_states %>%
  filter(year==2016) %>%
  select(year, name, evictions) %>%
  summarize(pct_chg=((evictions[name=="Massachusetts"]-evictions[name=="Florida"])/abs(evictions[name=="Florida"]))*100)
pct_chg
-78.06605

Because we are piping the information, we do not need to enter the name of the data frame object since the piping functionality will automatically fill that information in for us, referencing the output created in the step immediately preceding that final step.

As we can see, the number of evictions in Massachusetts was 78.1% lower than the number in Florida in 2016.

Of course, we need to keep in mind that Florida’s population is much larger than Massachusetts’. Thus, it would usually make the most sense to calculate the difference in their rates for a between-states comparison. If we have rates that are already expressed as percentages, it also makes the most sense to calculate the difference in percentage points (i.e., noting that 30% is a 20 percentage point increase from 10%) rather than trying to calculate a percent difference between percents (i.e., noting that 30% is a 200% increase from 10%).

Lagged Calculations

The fifth question I would like to ask my data is: What was the three-year percent change in eviction rate for each year between 2013 and 2016 in Massachusetts and Florida?

To answer this question, the first thing I would need to do is filter out all the states but Florida and Massachusetts. Then, I’d want to group the data by state so I can calculate the percent changes relative to the previous years for that state (and not other states). Then, I would want to order the data sequentially (to make sure the previous year is always the preceding row). Finally, I’d need to tell R to perform our percent change calculation using information from the current row as Y2 and the information from three rows back as Y1.

To perform that last step, we’ll use the dplyr::lag() function.

lag() allows us to look up a value from n elements (or rows in the case of a data frame) back in object x with the syntax, lag(x, n). For example, let’s consider the following code:

lag(c(1, 5, 7, 10, 13), 3)
## [1] NA NA NA  1  5

What we are getting is NA for the first three elements (because there were no values that far back, relative to those elements). The fourth value in our lagged vector was the first in our original vector, and the fifth lagged value was the second original value.

Answering the question

Let’s go back to our original question: What was the three-year percent change in eviction rate for each year between 2013 and 2016 in Massachusetts and Florida?

Here’s what I’d need to do to answer that question:

  1. Call the data from my data frame.

  2. Apply a filter to cover just the desired states.

  3. Optional: Select only the variables of interest to me, so I can more quickly see the answer and increase computational efficiency.

  4. Group the data based on my grouping variable of interest.

  5. Sort the data by year (and add the .by_group=TRUE argument to the sorting function so that we’re only sorting in-group and thus keeping our state observations in sequential order).

  6. Compute a new variable (we’ll call it evictions_pct_chg) using the percent change formula we’re already familiar with (((Y2-Y1)/Y1)*100), using the present value as Y2 and the value three years earlier as Y1.

  7. Optional: Apply a second filter that only includes values within the desired timeframe. (We do this at the end since we need data from the earlier years to compute our new evictions_pct_chg variable.)

Here’s what that logic looks like as code:

us_states %>%
  filter(name %in% c("Massachusetts", "Florida")) %>%
  select(year, name, `eviction-rate`) %>%
  group_by(name) %>%
  arrange(year, .by_group=TRUE) %>%
  mutate(evictions_pct_chg=((`eviction-rate`-lag(`eviction-rate`, 3))/lag(`eviction-rate`, 3))*100) %>%
  filter(between(year, 2013, 2016))
year name eviction-rate evictions_pct_chg
2013 Florida 2.55 -17.2077922
2014 Florida 2.48 -21.7665615
2015 Florida 2.53 -15.6666667
2016 Florida 2.53 -0.7843137
2013 Massachusetts 1.86 -7.0000000
2014 Massachusetts 1.81 -15.8139535
2015 Massachusetts 1.76 -13.7254902
2016 Massachusetts 1.52 -18.2795699

As we can see, Florida had the higher three-year percent change in 2015.

Calculating Correlations

Data journalists are often interested in looking at relationships between variables.

Scatterplots are generally best for spotting potential relationships, but it is sometimes helpful to also quickly calculate a linear correlation using Pearson’s correlation coefficient to gauge the strength of the linear relationship. (Important note: Relationships can be curvilinear in nature. How to evaluate that is better covered in a formal statistics course.)

To calculate a simple, linear correlation, we just use the base::cor() function, which is loaded automatically in R. With cor(), we just specify the two variables of interest as separate arguments (e.g., cor(dataset$variable1, dataset$variable2)).

Piping into cor()

We could also calculate multiple correlations at the same time by incorporating cor() into our summarize() function to calculate the linear correlations between (1) population size and the rent-burden and (2) population size and the eviction rate:

us_states %>%
  summarize(cor_pop_burden=cor(population, `rent-burden`), cor_pop_evicrate=cor(population, `eviction-rate`, use="pairwise.complete.obs"))
cor_pop_burden cor_pop_evicrate
0.2968485 0.0279709

Because some of our observations had an NA value for eviction-rate, we need to tell cor() to only evaluate observations that have non-missing data for those two variables via the use="pairwise.complete.obs" argument. If we omit that argument, we would just get NA as the value for the correlation.

What the above output tells us is that, when looking at every observation in our data frame, there is a positive (the r coefficient has a positive value) but weak (the r coefficient is less than +/-0.3) relationship between population size and the rent burden, and between the population size and the eviction rate. Thus, an increase in population size results in a linear increase of some proportion in the rent burden (or vice versa) and eviction rate (or vice versa) some of the time.

The nice thing about using our multi-step workflow (instead of just running cor() in isolation) is that it allows us to apply filters that can help narrow our inquiry (correlation in recent years vs. many years ago) or to calculate the correlation for computed (mutated) variables.

Causality and the nature of the relationship

As a reminder, there is a big difference between correlation and causation.

Even if we had found a strong correlation between population size and rent burden, the causal variable might be very different. (Put differently, it’s not the size of the population that causes a place to have a high rent burden but some other variable that is common in cities that have larger populations.)

And, again, it is crucial to stress here that we are only measuring a linear correlation (using Pearson’s r) right now. A strong relationship might exist but simply operate in a non-linear way (e.g., increase Y by a low magnitude at the low end of X but then increase Y at increasingly high magnitudes as you move up the values of X). Thus, be careful with how you interpret such statistics.

Recoding Data

It is not uncommon for a dataset to include typos (e.g., Massachusetts being expressed as Masachussets) or to have variations in capitalization (e.g., Massachusetts and massachusetts). To ensure that those values are grouped correctly, we would want to give it a consistent label.

One easy way to do that is to use the dplyr::recode() function to recode some of our values. With the recode() function, we first specify the variable we want to inspect and the values we’d like to recode in the following format: OriginalValue = ReplacedValue. Within a piped operation, we will typically pair it with the mutate() function (thereby making recode() the operation used to compute the new variable).

This dataset is in good shape and we do not need to recode any of the information. Thus, for demonstrative purposes, I would like to recode Massachusetts to The Best Commonwealth and Kentucky as The Worst Commonwealth. (Can you guess the other Commonwealths in the U.S.

Here’s how I could do that:

us_states %>%
  filter(name %in% c("Kentucky", "Massachusetts"), year == 2016) %>%
  mutate(new_name = recode(name, "Massachusetts" = "The Best Commonwealth", "Kentucky" = "The Worst Commonwealth")) %>%
  select(year, name, new_name)
year name new_name
2016 Kentucky The Worst Commonwealth
2016 Massachusetts The Best Commonwealth

There are two important things to note here:

  • First, I could easily just replace the values in the original variable (name) by simply using mutate to compute a variable called name (rather than new_name). I often like to create new variables when I make changes, but you do not have to.

  • Second, by default, recode() will keep the original value for non-matching values (e.g., Alabama stays as Alabama). We can also tell recode() to replace non-matching values with a specific string (or operation) by appending the .default="REPLACEMENT STRING" argument. We can do the same for missing values (NA) by appending the .missing="REPLACEMENT STRING" argument.

Changing the case

In instances where we want to standardize the case (e.g., make everything lower- or upper-case), we can also very simply use the tolower() and toupper() functions within our mutate() operation.

The tolower() and toupper() functions only require a single argument: the vector to apply the case change to.

For example, we can easily make all of our state names lower-case like this:

us_states %>%
  head() %>%
  mutate(name=tolower(name))
GEOID year name parent-location population poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-nh-pi pct-multiple pct-other eviction-filings evictions eviction-rate eviction-filing-rate low-flag imputed subbed
01 2000 alabama USA 4447100 16.10 255910 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 7585 4583 1.79 2.96 0 0 0
01 2001 alabama USA 4447100 16.10 486061 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 20828 10106 2.08 4.29 0 0 0
01 2002 alabama USA 4447100 16.10 495329 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 21070 9603 1.94 4.25 0 0 0
01 2003 alabama USA 4447100 16.10 456138 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 15750 7379 1.62 3.45 0 0 0
01 2004 alabama USA 4447100 16.10 446058 27.54 447 34135 85100 24.8 70.29 25.86 1.71 0.49 0.7 0.02 0.88 0.06 13508 6170 1.38 3.03 0 0 0
01 2005 alabama USA 4633360 12.86 449922 29.24 621 41216 111900 29.3 68.50 26.01 2.81 0.46 1.0 0.03 1.10 0.08 15426 8300 1.84 3.43 0 0 0