Home > Intermediate Exploratory Analysis > Intermediate Exploratory Data Analysis in R

Intermediate Exploratory Data Analysis in R

Introduction

In this tutorial, we will be applying some of the tidyverse functions to an initial exploratory data analysis of state-level data from the Eviction Lab. Specifically, we’ll be learning how to:

Use the %in% operator.
Perform partial string matching.
Compute your own variables, including per capita rates and percent changes.
Use the subscript notation to pull out specific information.
Perform lagged calculations like year-over-year changes.
Calculate the linear correlation between two variables.
Recode values.

The Dataset

The Eviction Lab has collected data from 80 million records around the country and you can read about the methodology here. As always, it is helpful to review the data dictionary to understand the meaning of the different variables (what each variable name corresponds to). If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

Pre-Analysis Steps

Load the data

The first step, of course, is to read in the data. Like before, we’ll use the readr::read_csv() function:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()

us_states <- read_csv("https://dds.rodrigozamith.com/files/evictions_us_states.csv")

## Rows: 867 Columns: 27

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): GEOID, name, parent-location
## dbl (24): year, population, poverty-rate, renter-occupied-households, pct-re...

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Confirm it imported correctly

Normally, we would make sure the data imported correctly. Because we’ve previously evaluated the data from this data source, we can just take a quick look at the first few observations to confirm that the data were imported correctly.

head(us_states)

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate
01	2000	Alabama	USA	4447100	16.10	255910	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	7585	4583	1.79	2.96
01	2001	Alabama	USA	4447100	16.10	486061	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	20828	10106	2.08	4.29
01	2002	Alabama	USA	4447100	16.10	495329	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	21070	9603	1.94	4.25
01	2003	Alabama	USA	4447100	16.10	456138	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	15750	7379	1.62	3.45
01	2004	Alabama	USA	4447100	16.10	446058	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	13508	6170	1.38	3.03
01	2005	Alabama	USA	4633360	12.86	449922	29.24	621	41216	111900	29.3	68.50	26.01	2.81	0.46	1.0	0.03	1.10	0.08	15426	8300	1.84	3.43

Just like before, it appears we imported the data correctly. Each observation represents a different state (name) at a different point in time (year).

Filtering Strings

Using the `%in%` operator

While most of the data crunching we do as data journalists involves numbers, we sometimes have to work with strings, too.

So, I’d like to begin this more advanced data analysis by asking my data: Which New England state had the highest eviction rate in 2016?

Our dataset does not have a variable covering the region represented by the state. Instead, we’ll have to create our own filter that only includes those states. We can turn to the U.S. Census Bureau-designated regions and divisions for help in determining which states are part of New England.

Rather than creating a filter with a bunch of different OR conditions, I can just use the %in% operator and pair it with a vector of values that I would like to include. Put another way, I can use the filter() function and supply the conditional as variable %in% c("vector", "with", "multiple", "values").

With this information in mind, here’s what I’d need to do to answer that question:

Call the data from my data frame.
Apply a filter to cover the desired year and the desired states.
Sort the data in descending order based on the variable of interest.
Optional: Select only the variables of interest to me, so I can more quickly see the answer.

Now, let’s translate that logic into code:

us_states %>%
  filter(name %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") & year == 2016) %>%
  arrange(desc(`eviction-rate`)) %>%
  select(year, name, `eviction-rate`)

year	name	eviction-rate
2016	Rhode Island	3.07
2016	Connecticut	3.04
2016	Maine	2.26
2016	New Hampshire	1.70
2016	Massachusetts	1.52
2016	Vermont	0.09

As we can see, Rhode Island was the New England state with the highest eviction rate in 2016.

Using `str_starts()` and `str_ends()`

Similarly, sometimes we want to create a filter for our data so we only include observations that begin with or end with certain text.

In this case, I’d like to ask my data: Which of the Virginias had the highest average eviction rate between 2013 and 2016?

One approach would be to create a filter with just those two state names. However, I can also use the stringr::str_ends() function to perform a search that looks for partial matches by evaluating the end of the string.

stringr is automatically loaded by the tidyverse package.

The arguments I need to provide str_ends() function are (1) the variable I’m evaluating (the string argument, which comes first) and (2) the pattern or value that must be present at the end for it to yield a TRUE response (the pattern argument, which comes second). As a reminder, the TRUE responses are the ones included in the results when we apply a filter with the filter() function.

So, let’s try to create a filter (filter()) that only includes observations where the name variable ends (str_ends()) with the string "Virginia":

us_states %>%
  filter(str_starts(name, "Virginia"))

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate
51	2000	Virginia	USA	7078515	9.59	664083	31.91	650	46677	125400	24.5	70.15	19.44	4.66	0.26	3.66	0.05	1.61	0.17	77522	40945	6.17	11.67
51	2001	Virginia	USA	7078515	9.59	734527	31.91	650	46677	125400	24.5	70.15	19.44	4.66	0.26	3.66	0.05	1.61	0.17	83498	44970	6.12	11.37
51	2002	Virginia	USA	7078515	9.59	761196	31.91	650	46677	125400	24.5	70.15	19.44	4.66	0.26	3.66	0.05	1.61	0.17	123429	49799	6.54	16.22
51	2003	Virginia	USA	7078515	9.59	885087	31.91	650	46677	125400	24.5	70.15	19.44	4.66	0.26	3.66	0.05	1.61	0.17	127023	52790	5.96	14.35
51	2004	Virginia	USA	7078515	9.59	900780	31.91	650	46677	125400	24.5	70.15	19.44	4.66	0.26	3.66	0.05	1.61	0.17	153631	55869	6.20	17.06
51	2005	Virginia	USA	7721730	7.16	914428	30.84	931	60316	247100	28.7	66.98	19.31	6.67	0.23	4.78	0.07	1.70	0.25	156448	51745	5.66	17.11
51	2006	Virginia	USA	7721730	7.16	928075	30.84	931	60316	247100	28.7	66.98	19.31	6.67	0.23	4.78	0.07	1.70	0.25	158878	50920	5.49	17.12
51	2007	Virginia	USA	7721730	7.16	613794	30.84	931	60316	247100	28.7	66.98	19.31	6.67	0.23	4.78	0.07	1.70	0.25	66538	37676	6.14	10.84
51	2008	Virginia	USA	7721730	7.16	623822	30.84	931	60316	247100	28.7	66.98	19.31	6.67	0.23	4.78	0.07	1.70	0.25	65233	42797	6.86	10.46
51	2009	Virginia	USA	7721730	7.16	629765	30.84	931	60316	247100	28.7	66.98	19.31	6.67	0.23	4.78	0.07	1.70	0.25	62974	38357	6.09	10.00
51	2010	Virginia	USA	8001024	7.79	685038	32.75	1060	63636	249700	29.9	64.82	19.04	7.90	0.26	5.45	0.06	2.27	0.19	94057	45125	6.59	13.73
51	2011	Virginia	USA	8256630	8.22	866175	33.82	1116	65015	245000	29.9	63.44	18.90	8.59	0.21	5.93	0.06	2.66	0.22	144511	50013	5.77	16.68
51	2012	Virginia	USA	8256630	8.22	881194	33.82	1116	65015	245000	29.9	63.44	18.90	8.59	0.21	5.93	0.06	2.66	0.22	138282	48198	5.47	15.69
51	2013	Virginia	USA	8256630	8.22	906176	33.82	1116	65015	245000	29.9	63.44	18.90	8.59	0.21	5.93	0.06	2.66	0.22	151841	51134	5.64	16.76
51	2014	Virginia	USA	8256630	8.22	917828	33.82	1116	65015	245000	29.9	63.44	18.90	8.59	0.21	5.93	0.06	2.66	0.22	147094	51101	5.57	16.03
51	2015	Virginia	USA	8256630	8.22	996441	33.82	1116	65015	245000	29.9	63.44	18.90	8.59	0.21	5.93	0.06	2.66	0.22	144220	50213	5.04	14.47
51	2016	Virginia	USA	8256630	8.22	1012175	33.82	1116	65015	245000	29.9	63.44	18.90	8.59	0.21	5.93	0.06	2.66	0.22	146534	51821	5.12	14.48

We are using str_ends() (which comes with stringr package) and not ends_with() (which comes with the sister dplyr package). dplyr::ends_with() is a separate function used to select multiple variables from a data frame based on patterns in the variable names, and can be useful when you have a large dataset with hundreds of variables and want a sub-group of them. stringr::str_ends() evaluates a given string or vector to see if it matches the condition, yielding a TRUE or FALSE response that can serve as the basis of a filter.

Unlike prior instances of dplyr::filter() where we specified a variable, an evaluation operator, and the condition (e.g., name == "Virginia"), the str_ends() function expects a different syntax. This is common in R, especially when you are using functions from different packages. Thus, be sure to read each function’s documentation to understand what it’s looking for.

We can similarly look at the start of a string by using the str_starts() function, which also comes with R. For example, let’s include only the states (name) that begin with the string "North":

us_states %>%
  filter(str_starts(name, "North"))

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate
37	2000	North Carolina	USA	8049313	12.28	393465	30.64	548	39184	108300	24.3	70.16	21.41	4.71	1.18	1.40	0.04	0.99	0.11	26050	22292	5.67	6.62
37	2001	North Carolina	USA	8049313	12.28	495885	30.64	548	39184	108300	24.3	70.16	21.41	4.71	1.18	1.40	0.04	0.99	0.11	49761	18780	3.79	10.03
37	2002	North Carolina	USA	8049313	12.28	566782	30.64	548	39184	108300	24.3	70.16	21.41	4.71	1.18	1.40	0.04	0.99	0.11	88107	18880	3.33	15.55
37	2003	North Carolina	USA	8049313	12.28	618050	30.64	548	39184	108300	24.3	70.16	21.41	4.71	1.18	1.40	0.04	0.99	0.11	92789	19057	3.08	15.01
37	2004	North Carolina	USA	8049313	12.28	734415	30.64	548	39184	108300	24.3	70.16	21.41	4.71	1.18	1.40	0.04	0.99	0.11	114477	36944	5.03	15.59
37	2005	North Carolina	USA	9045705	11.06	1036534	31.95	702	45069	143700	29.1	67.45	20.94	7.00	1.08	1.90	0.04	1.39	0.20	190778	50800	4.90	18.41
37	2006	North Carolina	USA	9045705	11.06	1063201	31.95	702	45069	143700	29.1	67.45	20.94	7.00	1.08	1.90	0.04	1.39	0.20	177708	51731	4.87	16.71
37	2007	North Carolina	USA	9045705	11.06	1082018	31.95	702	45069	143700	29.1	67.45	20.94	7.00	1.08	1.90	0.04	1.39	0.20	156923	63019	5.82	14.50
37	2008	North Carolina	USA	9045705	11.06	1114318	31.95	702	45069	143700	29.1	67.45	20.94	7.00	1.08	1.90	0.04	1.39	0.20	166874	71269	6.40	14.98
37	2009	North Carolina	USA	9045705	11.06	1146099	31.95	702	45069	143700	29.1	67.45	20.94	7.00	1.08	1.90	0.04	1.39	0.20	160703	67767	5.91	14.02
37	2010	North Carolina	USA	9535483	12.41	1173438	33.30	759	46450	153600	30.3	65.27	21.18	8.39	1.14	2.17	0.06	1.63	0.16	167293	66669	5.68	14.26
37	2011	North Carolina	USA	9845333	12.79	1203775	34.87	797	46868	154900	30.3	64.24	21.17	8.84	1.10	2.46	0.05	1.94	0.20	165870	63278	5.26	13.78
37	2012	North Carolina	USA	9845333	12.79	1234112	34.87	797	46868	154900	30.3	64.24	21.17	8.84	1.10	2.46	0.05	1.94	0.20	150500	57668	4.67	12.20
37	2013	North Carolina	USA	9845333	12.79	1264451	34.87	797	46868	154900	30.3	64.24	21.17	8.84	1.10	2.46	0.05	1.94	0.20	156613	56020	4.43	12.39
37	2014	North Carolina	USA	9845333	12.79	1294790	34.87	797	46868	154900	30.3	64.24	21.17	8.84	1.10	2.46	0.05	1.94	0.20	156090	60365	4.66	12.06
37	2015	North Carolina	USA	9845333	12.79	1325128	34.87	797	46868	154900	30.3	64.24	21.17	8.84	1.10	2.46	0.05	1.94	0.20	148593	59940	4.52	11.21
37	2016	North Carolina	USA	9845333	12.79	1355467	34.87	797	46868	154900	30.3	64.24	21.17	8.84	1.10	2.46	0.05	1.94	0.20	147038	62539	4.61	10.85
38	2000	North Dakota	USA	642200	11.86	85853	33.39	412	34604	74400	22.3	91.74	0.59	1.21	4.79	0.56	0.03	1.04	0.04	408	NA	NA	0.48
38	2001	North Dakota	USA	642200	11.86	86993	33.39	412	34604	74400	22.3	91.74	0.59	1.21	4.79	0.56	0.03	1.04	0.04	456	NA	NA	0.52
38	2002	North Dakota	USA	642200	11.86	88132	33.39	412	34604	74400	22.3	91.74	0.59	1.21	4.79	0.56	0.03	1.04	0.04	515	NA	NA	0.58
38	2003	North Dakota	USA	642200	11.86	89272	33.39	412	34604	74400	22.3	91.74	0.59	1.21	4.79	0.56	0.03	1.04	0.04	474	NA	NA	0.53
38	2004	North Dakota	USA	642200	11.86	90411	33.39	412	34604	74400	22.3	91.74	0.59	1.21	4.79	0.56	0.03	1.04	0.04	522	NA	NA	0.58
38	2005	North Dakota	USA	639725	7.37	91551	34.32	535	45140	104300	25.3	89.62	0.91	1.97	5.25	0.82	0.05	1.34	0.05	503	NA	NA	0.55
38	2006	North Dakota	USA	639725	7.37	92691	34.32	535	45140	104300	25.3	89.62	0.91	1.97	5.25	0.82	0.05	1.34	0.05	560	NA	NA	0.60
38	2007	North Dakota	USA	639725	7.37	93830	34.32	535	45140	104300	25.3	89.62	0.91	1.97	5.25	0.82	0.05	1.34	0.05	501	NA	NA	0.53
38	2008	North Dakota	USA	639725	7.37	94970	34.32	535	45140	104300	25.3	89.62	0.91	1.97	5.25	0.82	0.05	1.34	0.05	473	NA	NA	0.50
38	2009	North Dakota	USA	639725	7.37	96109	34.32	535	45140	104300	25.3	89.62	0.91	1.97	5.25	0.82	0.05	1.34	0.05	495	NA	NA	0.52
38	2010	North Dakota	USA	672591	7.45	97249	34.58	610	51641	123900	24.9	88.91	1.15	2.00	5.29	1.02	0.04	1.54	0.05	533	NA	NA	0.55
38	2011	North Dakota	USA	721640	7.23	101400	35.86	709	57181	153800	24.9	87.05	1.60	2.85	5.20	1.22	0.04	2.00	0.05	618	NA	NA	0.61
38	2012	North Dakota	USA	721640	7.23	105552	35.86	709	57181	153800	24.9	87.05	1.60	2.85	5.20	1.22	0.04	2.00	0.05	680	NA	NA	0.64
38	2013	North Dakota	USA	721640	7.23	109703	35.86	709	57181	153800	24.9	87.05	1.60	2.85	5.20	1.22	0.04	2.00	0.05	790	NA	NA	0.72
38	2014	North Dakota	USA	721640	7.23	113854	35.86	709	57181	153800	24.9	87.05	1.60	2.85	5.20	1.22	0.04	2.00	0.05	1054	NA	NA	0.93
38	2015	North Dakota	USA	721640	7.23	118006	35.86	709	57181	153800	24.9	87.05	1.60	2.85	5.20	1.22	0.04	2.00	0.05	1187	NA	NA	1.01
38	2016	North Dakota	USA	721640	7.23	122157	35.86	709	57181	153800	24.9	87.05	1.60	2.85	5.20	1.22	0.04	2.00	0.05	1339	NA	NA	1.10

The pattern argument accepts regular expressions, which allow for more complex pattern matching. (For example, maybe you are looking for a string that only has letters in it, and no numbers.) In practice, this means that commonly used characters like the dollar sign ($) have a special meaning in those patterns (in this case, $ indicates the match should occur at the end of the string — which is especially useless in our context). That can be a source of frustration if you’re just looking for a dollar value in a string. To get around that, enclose the pattern you enter within the coll() function like so: str_starts(amount, coll("$300")). This would bypass any regular expression matching, and simply interpret your text literally (as a human would). Put another way, that example would capture an amount like “$300,000” (assuming it is being stored as a string in your data frame) because it started with “$300”.

So, let’s return to our original question: Which of the Virginias had the highest average eviction rate between 2013 and 2016?

Here’s what I’d need to do to answer that question:

Call the data from my data frame.
Apply a filter to cover the observations ending with my desired string and matching the years within my desired range.
Group the data based on the grouping variable of interest.
Calculate the summary statistic for the mean of the desired variable.

We can translate that into code like so:

us_states %>%
  filter(str_ends(name, "Virginia") & year >= 2013 & year <= 2016) %>%
  group_by(name) %>%
  summarize(mean_eviction_rate=mean(`eviction-rate`))

name	mean_eviction_rate
Virginia	5.3425
West Virginia	3.6525

As we can see here, the state of Virginia had the higher eviction rate during that period of time.

Advanced string matching

What if we want to just filter observations based on some text that is present in the middle of a string? Here, we can use stringr::str_detect().

str_detect() allows you to match text based on patterns, or regular expressions. These patterns can get very complicated (but are quite powerful), and you can read more about them here.

For the sake of simplicity, we won’t cover patterns here. Instead, we’ll just use str_detect() to look for entries that include a specified string anywhere within a given variable. For example, we can filter in any state with “y” in its name and count the number of unique states like so:

us_states %>%
  filter(str_detect(name, "y")) %>%
  count(name)

name	n
Kentucky	17
Maryland	17
New Jersey	17
Pennsylvania	17
Wyoming	17

Be careful when using punctuation with str_detect(). First, the function is case-sensitive by default, so searching for “Y” will give you different results than “y” (i.e., New York is covered by “Y” but not “y”). Second, the function may think you are giving it instructions for pattern-seeking. For example, “^North” tells it to match only when “North” appears at the start of the sentence—just like str_starts(). Any special characters need to be escaped, or you can use the coll() function as described above. You can read about that in the documentation for the function.

Calculating Your Own Rates

The third question I’d like to ask my data is: What were the eviction rates in Massachusetts between 2010 and 2016, expressed on a per 100,000 people basis?

We already have data about eviction rates in the eviction-rate variable. However, those data are on a per 100 renter homes basis. In order to address that question, I would need to calculate my own rate. Thankfully, I have the variables necessary to do that (evictions and population).

To compute a new variable, we can use the dplyr::mutate() function. mutate() operates just like summarize() in the sense that you give it a variable name and provide the operation necessary to calculate it. However, the key difference is that the variables you create are simply appended to the existing data frame—they do not reduce the data frame into a set of summary statistics.

A rate like the one we are interested in can be easily computed through the following calculation: X / Y (where X refers to the number of evictions and Y to the population). To get the per 100,000 people basis, we would multiply that result by 100,000.

So, here’s what I’d need to do to answer that third question:

Call the data from my data frame.
Apply a filter to cover the desired state and the timeframe.
Create a new variable (we’ll call it eviction_rate_pop) based on our rate calculation.
Optional: Select only the variables of interest to me, so I can more quickly see the answer.

And here is how I can translate that into code:

us_states %>%
  filter(name == "Massachusetts" & between(year, 2010, 2016)) %>%
  mutate(eviction_rate_pop = (evictions/population)*100000) %>%
  select(year, name, population, evictions, eviction_rate_pop)

year	name	population	evictions	eviction_rate_pop
2010	Massachusetts	6547629	18803	287.1727
2011	Massachusetts	6705586	20452	304.9994
2012	Massachusetts	6705586	19648	293.0094
2013	Massachusetts	6705586	18131	270.3865
2014	Massachusetts	6705586	18289	272.7428
2015	Massachusetts	6705586	17986	268.2241
2016	Massachusetts	6705586	15708	234.2525

We can also use the dplyr::between() function to replicate the dual conditions of year >= 2010 and year ⇐ 2016. They are functionally equivalent.

Looking at this output, I can see that there were 270 evictions for every 100,000 people who lived in Massachusetts in 2013. (And, for other years between 2010 and 2016, too!)

Things to be mindful of

One thing to be mindful of is that the population statistics in our dataset change pretty infrequently. For example, we have the same population value for all years between 2011-2016. That’s probably because the Eviction Lab team are relying on the most comprehensive counts from the decennial U.S. Census.

However, the U.S. Census also offers less precise, but more frequently updated, annual data from its annual American Community Survey. If we wanted more recent population counts, we would thus need to merge the Eviction Lab data with data from the American Community Survey. This is possible, but there may be a good reason for keeping the less-recent counts, and we should therefore see if the methodological details explain the reason for that decision.

A second thing to be mindful of is that deciding the basis for a rate (e.g., 1,000 vs. 100,000) is more of an art than a science. It requires an understanding of the context and the base rates that your audiences are most familiar with. For example, we tend to see rates for states and cities expressed on a per 100,000 people basis, and it is thus an already familiar anchor for most folks. As such, that’s often a helpful basis to use. But that may not be the case in all contexts.

Alternatively, the Eviction Lab likely provides its rates on a per-100 renters basis because it is easy to also present that as a percentage (e.g., 2.1 per 100 can be translated to 2.1%). In fact, they present that rate as a percentage on the visuals on their website.

The key is to use a basis (‘per’ number) that you believe would be most informative to your audience—and not the one that makes a statistic seem most or least alarming.

Calculating a Percent Change

A fourth question I would like to ask my data is: How much higher (as a percentage) was Massachusetts’ number of evictions in 2016, relative to Florida’s?

As a reminder, we can calculate a percent change as follows: (Y2-Y1)/abs(Y1)*100. (The abs() function allows us to get the absolute value of a number.)

Note that there is a difference between percent change and percent difference. People are typically more familiar with percent change (% increase or % decrease), so you’ll typically see journalists use that calculation. Additionally, keep in mind that it matters which entity (e.g., Massachusetts or Florida) you assign to Y2. Typically, the value of greatest interest (or the newest value) should be Y2. When in doubt, double-check your calculations against an online calculator, like the ones linked to here.

I could apply that formula as part of a two-step operation where I manually look up Massachusetts’ number (Y2) of evictions as well as Florida’s (Y1), and then (1) manually subtract Massachusetts’ number from Florida’s, (2) divide the result by Florida’s number, and (3) multiply that by 100.

However, we can also do this in a single operation by calculating a summary statistic using that equation. In order to get the data necessary for that operation, we can use the subscripting approach that is native to R.

Subscripting

R allows us to pull out specific information from its outputs, which can be handy when you want to perform calculations using filtered or mutated data. We have already performed some calculations with different dplyr functions, but this is a good opportunity to demonstrate an alternative that is sometimes necessary for functions that don’t play nicely with tidyverse or to do some things we want within tidyverse.

First, let’s try getting information from a vector. As a reminder, we can call information from a variable in a data frame with the object$variable notation, such as when we run:

head(us_states$evictions)

## [1]  4583 10106  9603  7379  6170  8300

Let’s say we wanted to get just the first value from that vector (4583). Here’s where a subscript would be useful. Subscripts allow you to call specific elements from an object by specifying it in brackets. As such, us_states$evictions[1] would produce only the first element from that vector.

us_states$evictions[1]

## [1] 4583

If your object is a data frame, which has observations (rows) and variables (columns), we subscript using matrix notation [(observation), (variable)] (or [(row), (column)]; note the comma in both. For example, if I wanted to get the value in the first row and third column of our us_states data frame, I’d write:

us_states[1,3]

name
Alabama

I could also provide just one of those values (either observation or variable), but keep the comma, in order to get the information from my desired observation or variable. For example, here’s the entire first row in our dataset.

us_states[1,]

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate	low-flag	imputed	subbed
01	2000	Alabama	USA	4447100	16.1	255910	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	7585	4583	1.79	2.96	0	0	0

Within the tidyverse, we could use slice(us_states, 1) to replicate us_states[1,] and select(us_states, 3) to replicate us_states[,3].

Notably, we can also use subscripting to filter information based on a condition (like we do with the filter() function). For example, if I just wanted rows that had the name variable be equivalent to “Wyoming”, I could write:

us_states[us_states$name == "Wyoming",]

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate	low-flag
56	2000	Wyoming	USA	493782	11.42	28647	30.01	437	37892	96600	22.5	88.86	0.71	6.41	2.07	0.54	0.05	1.25	0.10	32	31	0.11	0.11	1
56	2001	Wyoming	USA	493782	11.42	29310	30.01	437	37892	96600	22.5	88.86	0.71	6.41	2.07	0.54	0.05	1.25	0.10	31	26	0.09	0.11	1
56	2002	Wyoming	USA	493782	11.42	38257	30.01	437	37892	96600	22.5	88.86	0.71	6.41	2.07	0.54	0.05	1.25	0.10	152	72	0.19	0.40	1
56	2003	Wyoming	USA	493782	11.42	49547	30.01	437	37892	96600	22.5	88.86	0.71	6.41	2.07	0.54	0.05	1.25	0.10	701	368	0.74	1.41	1
56	2004	Wyoming	USA	493782	11.42	50530	30.01	437	37892	96600	22.5	88.86	0.71	6.41	2.07	0.54	0.05	1.25	0.10	929	496	0.98	1.84	1
56	2005	Wyoming	USA	523949	6.08	51513	30.10	637	51990	163400	24.2	86.87	0.88	7.59	2.03	0.72	0.03	1.77	0.10	991	539	1.05	1.92	1
56	2006	Wyoming	USA	523949	6.08	52496	30.10	637	51990	163400	24.2	86.87	0.88	7.59	2.03	0.72	0.03	1.77	0.10	872	487	0.93	1.66	1
56	2007	Wyoming	USA	523949	6.08	53478	30.10	637	51990	163400	24.2	86.87	0.88	7.59	2.03	0.72	0.03	1.77	0.10	866	426	0.80	1.62	1
56	2008	Wyoming	USA	523949	6.08	54460	30.10	637	51990	163400	24.2	86.87	0.88	7.59	2.03	0.72	0.03	1.77	0.10	695	367	0.67	1.28	1
56	2009	Wyoming	USA	523949	6.08	55443	30.10	637	51990	163400	24.2	86.87	0.88	7.59	2.03	0.72	0.03	1.77	0.10	430	415	0.75	0.78	1
56	2010	Wyoming	USA	563626	7.21	44382	30.77	733	56573	184400	25.7	85.85	0.77	8.91	2.09	0.76	0.06	1.47	0.08	107	106	0.24	0.24	1
56	2011	Wyoming	USA	579679	7.69	37604	30.94	789	58840	194800	25.6	84.47	1.03	9.61	1.90	0.89	0.07	2.01	0.02	94	92	0.24	0.25	1
56	2012	Wyoming	USA	579679	7.69	38682	30.94	789	58840	194800	25.6	84.47	1.03	9.61	1.90	0.89	0.07	2.01	0.02	67	65	0.17	0.17	1
56	2013	Wyoming	USA	579679	7.69	39761	30.94	789	58840	194800	25.6	84.47	1.03	9.61	1.90	0.89	0.07	2.01	0.02	87	87	0.22	0.22	1
56	2014	Wyoming	USA	579679	7.69	40840	30.94	789	58840	194800	25.6	84.47	1.03	9.61	1.90	0.89	0.07	2.01	0.02	94	94	0.23	0.23	1
56	2015	Wyoming	USA	579679	7.69	41918	30.94	789	58840	194800	25.6	84.47	1.03	9.61	1.90	0.89	0.07	2.01	0.02	243	241	0.57	0.58	1
56	2016	Wyoming	USA	579679	7.69	42997	30.94	789	58840	194800	25.6	84.47	1.03	9.61	1.90	0.89	0.07	2.01	0.02	382	377	0.88	0.89	1

This notation requires us to specify things as follows: object[object$variable condition equivalency,].

Returning to our question

Let’s return to our fourth question: How much higher (as a percentage) was Massachusetts’ number of evictions in 2016, relative to Florida’s?

Here’s how I can approach the question:

Call the data from my data frame.
Apply a filter to cover the desired timeframe.
Optional: Select the relevant columns to make it easier to read the output.
Summarize the data by creating a new variable called pct_chg that uses subscripting to (1) manually subtract Massachusetts’ number of evictions from Florida’s, (2) divide the result by Florida’s number, and (3) multiply that by 100.

Here’s what that logic looks like as code:

us_states %>%
  filter(year==2016) %>%
  select(year, name, evictions) %>%
  summarize(pct_chg=((evictions[name=="Massachusetts"]-evictions[name=="Florida"])/abs(evictions[name=="Florida"]))*100)

pct_chg
-78.06605

Because we are piping the information, we do not need to enter the name of the data frame object since the piping functionality will automatically fill that information in for us, referencing the output created in the step immediately preceding that final step.

As we can see, the number of evictions in Massachusetts was 78.1% lower than the number in Florida in 2016.

Of course, we need to keep in mind that Florida’s population is much larger than Massachusetts’. Thus, it would usually make the most sense to calculate the difference in their rates for a between-states comparison. If we have rates that are already expressed as percentages, it also makes the most sense to calculate the difference in percentage points (i.e., noting that 30% is a 20 percentage point increase from 10%) rather than trying to calculate a percent difference between percents (i.e., noting that 30% is a 200% increase from 10%).

Lagged Calculations

The fifth question I would like to ask my data is: What was the three-year percent change in eviction rate for each year between 2013 and 2016 in Massachusetts and Florida?

To answer this question, the first thing I would need to do is filter out all the states but Florida and Massachusetts. Then, I’d want to group the data by state so I can calculate the percent changes relative to the previous years for that state (and not other states). Then, I would want to order the data sequentially (to make sure the previous year is always the preceding row). Finally, I’d need to tell R to perform our percent change calculation using information from the current row as Y2 and the information from three rows back as Y1.

To perform that last step, we’ll use the dplyr::lag() function.

lag() allows us to look up a value from n elements (or rows in the case of a data frame) back in object x with the syntax, lag(x, n). For example, let’s consider the following code:

lag(c(1, 5, 7, 10, 13), 3)

## [1] NA NA NA  1  5

What we are getting is NA for the first three elements (because there were no values that far back, relative to those elements). The fourth value in our lagged vector was the first in our original vector, and the fifth lagged value was the second original value.

Answering the question

Let’s go back to our original question: What was the three-year percent change in eviction rate for each year between 2013 and 2016 in Massachusetts and Florida?

Here’s what I’d need to do to answer that question:

Call the data from my data frame.
Apply a filter to cover just the desired states.
Optional: Select only the variables of interest to me, so I can more quickly see the answer and increase computational efficiency.
Group the data based on my grouping variable of interest.
Sort the data by year (and add the .by_group=TRUE argument to the sorting function so that we’re only sorting in-group and thus keeping our state observations in sequential order).
Compute a new variable (we’ll call it evictions_pct_chg) using the percent change formula we’re already familiar with (((Y2-Y1)/Y1)*100), using the present value as Y2 and the value three years earlier as Y1.
Optional: Apply a second filter that only includes values within the desired timeframe. (We do this at the end since we need data from the earlier years to compute our new evictions_pct_chg variable.)

Here’s what that logic looks like as code:

us_states %>%
  filter(name %in% c("Massachusetts", "Florida")) %>%
  select(year, name, `eviction-rate`) %>%
  group_by(name) %>%
  arrange(year, .by_group=TRUE) %>%
  mutate(evictions_pct_chg=((`eviction-rate`-lag(`eviction-rate`, 3))/lag(`eviction-rate`, 3))*100) %>%
  filter(between(year, 2013, 2016))

year	name	eviction-rate	evictions_pct_chg
2013	Florida	2.55	-17.2077922
2014	Florida	2.48	-21.7665615
2015	Florida	2.53	-15.6666667
2016	Florida	2.53	-0.7843137
2013	Massachusetts	1.86	-7.0000000
2014	Massachusetts	1.81	-15.8139535
2015	Massachusetts	1.76	-13.7254902
2016	Massachusetts	1.52	-18.2795699

As we can see, Florida had the higher three-year percent change in 2015.

Calculating Correlations

Data journalists are often interested in looking at relationships between variables.

Scatterplots are generally best for spotting potential relationships, but it is sometimes helpful to also quickly calculate a linear correlation using Pearson’s correlation coefficient to gauge the strength of the linear relationship. (Important note: Relationships can be curvilinear in nature. How to evaluate that is better covered in a formal statistics course.)

To calculate a simple, linear correlation, we just use the base::cor() function, which is loaded automatically in R. With cor(), we just specify the two variables of interest as separate arguments (e.g., cor(dataset$variable1, dataset$variable2)).

Piping into `cor()`

We could also calculate multiple correlations at the same time by incorporating cor() into our summarize() function to calculate the linear correlations between (1) population size and the rent-burden and (2) population size and the eviction rate:

us_states %>%
  summarize(cor_pop_burden=cor(population, `rent-burden`), cor_pop_evicrate=cor(population, `eviction-rate`, use="pairwise.complete.obs"))

cor_pop_burden	cor_pop_evicrate
0.2968485	0.0279709

Because some of our observations had an NA value for eviction-rate, we need to tell cor() to only evaluate observations that have non-missing data for those two variables via the use="pairwise.complete.obs" argument. If we omit that argument, we would just get NA as the value for the correlation.

What the above output tells us is that, when looking at every observation in our data frame, there is a positive (the r coefficient has a positive value) but weak (the r coefficient is less than +/-0.3) relationship between population size and the rent burden, and between the population size and the eviction rate. Thus, an increase in population size results in a linear increase of some proportion in the rent burden (or vice versa) and eviction rate (or vice versa) some of the time.

The nice thing about using our multi-step workflow (instead of just running cor() in isolation) is that it allows us to apply filters that can help narrow our inquiry (correlation in recent years vs. many years ago) or to calculate the correlation for computed (mutated) variables.

Causality and the nature of the relationship

As a reminder, there is a big difference between correlation and causation.

Even if we had found a strong correlation between population size and rent burden, the causal variable might be very different. (Put differently, it’s not the size of the population that causes a place to have a high rent burden but some other variable that is common in cities that have larger populations.)

And, again, it is crucial to stress here that we are only measuring a linear correlation (using Pearson’s r) right now. A strong relationship might exist but simply operate in a non-linear way (e.g., increase Y by a low magnitude at the low end of X but then increase Y at increasingly high magnitudes as you move up the values of X). Thus, be careful with how you interpret such statistics.

Recoding Data

It is not uncommon for a dataset to include typos (e.g., Massachusetts being expressed as Masachussets) or to have variations in capitalization (e.g., Massachusetts and massachusetts). To ensure that those values are grouped correctly, we would want to give it a consistent label.

One easy way to do that is to use the dplyr::recode() function to recode some of our values. With the recode() function, we first specify the variable we want to inspect and the values we’d like to recode in the following format: OriginalValue = ReplacedValue. Within a piped operation, we will typically pair it with the mutate() function (thereby making recode() the operation used to compute the new variable).

This dataset is in good shape and we do not need to recode any of the information. Thus, for demonstrative purposes, I would like to recode Massachusetts to The Best Commonwealth and Kentucky as The Worst Commonwealth. (Can you guess the other Commonwealths in the U.S.

Here’s how I could do that:

us_states %>%
  filter(name %in% c("Kentucky", "Massachusetts"), year == 2016) %>%
  mutate(new_name = recode(name, "Massachusetts" = "The Best Commonwealth", "Kentucky" = "The Worst Commonwealth")) %>%
  select(year, name, new_name)

year	name	new_name
2016	Kentucky	The Worst Commonwealth
2016	Massachusetts	The Best Commonwealth

There are two important things to note here:

First, I could easily just replace the values in the original variable (name) by simply using mutate to compute a variable called name (rather than new_name). I often like to create new variables when I make changes, but you do not have to.
Second, by default, recode() will keep the original value for non-matching values (e.g., Alabama stays as Alabama). We can also tell recode() to replace non-matching values with a specific string (or operation) by appending the .default="REPLACEMENT STRING" argument. We can do the same for missing values (NA) by appending the .missing="REPLACEMENT STRING" argument.

Changing the case

In instances where we want to standardize the case (e.g., make everything lower- or upper-case), we can also very simply use the tolower() and toupper() functions within our mutate() operation.

The tolower() and toupper() functions only require a single argument: the vector to apply the case change to.

For example, we can easily make all of our state names lower-case like this:

us_states %>%
  head() %>%
  mutate(name=tolower(name))

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate
01	2000	alabama	USA	4447100	16.10	255910	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	7585	4583	1.79	2.96
01	2001	alabama	USA	4447100	16.10	486061	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	20828	10106	2.08	4.29
01	2002	alabama	USA	4447100	16.10	495329	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	21070	9603	1.94	4.25
01	2003	alabama	USA	4447100	16.10	456138	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	15750	7379	1.62	3.45
01	2004	alabama	USA	4447100	16.10	446058	27.54	447	34135	85100	24.8	70.29	25.86	1.71	0.49	0.7	0.02	0.88	0.06	13508	6170	1.38	3.03
01	2005	alabama	USA	4633360	12.86	449922	29.24	621	41216	111900	29.3	68.50	26.01	2.81	0.46	1.0	0.03	1.10	0.08	15426	8300	1.84	3.43

Intermediate Exploratory Data Analysis in R

Introduction

The Dataset

Pre-Analysis Steps

Load the data

Confirm it imported correctly

Filtering Strings

Using the %in% operator

Using str_starts() and str_ends()

Advanced string matching

Calculating Your Own Rates

Things to be mindful of

Calculating a Percent Change

Subscripting

Returning to our question

Lagged Calculations

Answering the question

Calculating Correlations

Piping into cor()

Causality and the nature of the relationship

Recoding Data

Changing the case

Using the `%in%` operator

Using `str_starts()` and `str_ends()`

Piping into `cor()`