Home > Introduction to Exploratory Analysis > Exploratory Data Visualization in R

Exploratory Data Visualization in R

Introduction

In this tutorial, we will be using some functions from the DataExplorer and plotly packages to help us perform an initial exploratory visual analysis of county-level data from the Eviction Lab. We’ll also do some basic plotting of our own with the ggplot2 package. Specifically, we’ll be learning how to:

Visualize missing values in the data.
Generate histograms for all your numerical variables.
Generate bar plots for all your nominal variables.
Create a line graph using ggplot.
Add some basic interactivity to your ggplots.

The Dataset

The Eviction Lab has collected data from 80 million records around the country and you can read about the methodology here. Desmond’s team also makes data available at more micro levels ([U.S. Census tracts]((https://www.census.gov/geo/reference/webatlas/tracts.html) and block groups), as well as more macro levels (county, statewide).

As always, it is helpful to review the data dictionary to understand the meaning of the different variables (what each variable name corresponds to). If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

Pre-Analysis Steps

Load the data

The first step, of course, is to read in the data. We’ll use the readr::read_csv() function (readr can be loaded by the tidyverse metapackage) and store the data in an object called ma_counties.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()

ma_counties <- read_csv("https://dds.rodrigozamith.com/files/evictions_ma_counties.csv")

## Rows: 238 Columns: 27

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): name, parent-location
## dbl (25): GEOID, year, population, poverty-rate, renter-occupied-households,...

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Confirm it imported correctly

The next step is to confirm the data were imported the way we wanted it to. Here, I would recommend that you follow three steps.

Step 1: Check the first few observations

Start by checking the first few observations (rows) of the object to check if there’s anything odd using the head() function:

head(ma_counties)

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate
25001	2000	Barnstable County	Massachusetts	222230	6.89	21035	22.18	723	45933	178800	27.7	93.38	1.73	1.35	0.52	0.61	0.02	1.56	0.83	NA	NA	NA	NA
25001	2001	Barnstable County	Massachusetts	222230	6.89	21096	22.18	723	45933	178800	27.7	93.38	1.73	1.35	0.52	0.61	0.02	1.56	0.83	NA	NA	NA	NA
25001	2002	Barnstable County	Massachusetts	222230	6.89	21157	22.18	723	45933	178800	27.7	93.38	1.73	1.35	0.52	0.61	0.02	1.56	0.83	NA	NA	NA	NA
25001	2003	Barnstable County	Massachusetts	222230	6.89	21218	22.18	723	45933	178800	27.7	93.38	1.73	1.35	0.52	0.61	0.02	1.56	0.83	767	751	3.54	3.61
25001	2004	Barnstable County	Massachusetts	222230	6.89	21279	22.18	723	45933	178800	27.7	93.38	1.73	1.35	0.52	0.61	0.02	1.56	0.83	NA	NA	NA	NA
25001	2005	Barnstable County	Massachusetts	222629	4.31	21340	18.91	1045	60096	399900	32.8	93.05	1.90	1.76	0.37	0.93	0.01	1.58	0.40	NA	NA	NA	NA

Just like before, it appears we imported the data correctly. Each observation represents a different county (name) at a different point in time (year).

Step 2: Check the structure

Next, we’ll use the glimpse() function to check all the column names and make sure they’re associated with the right data type:

glimpse(ma_counties)

## Rows: 238
## Columns: 27
## $ GEOID                        <dbl> 25001, 25001, 25001, 25001, 25001, 25001,...
## $ year                         <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006,...
## $ name                         <chr> "Barnstable County", "Barnstable County",...
## $ `parent-location`            <chr> "Massachusetts", "Massachusetts", "Massac...
## $ population                   <dbl> 222230, 222230, 222230, 222230, 222230, 2...
## $ `poverty-rate`               <dbl> 6.89, 6.89, 6.89, 6.89, 6.89, 4.31, 4.31,...
## $ `renter-occupied-households` <dbl> 21035, 21096, 21157, 21218, 21279, 21340,...
## $ `pct-renter-occupied`        <dbl> 22.18, 22.18, 22.18, 22.18, 22.18, 18.91,...
## $ `median-gross-rent`          <dbl> 723, 723, 723, 723, 723, 1045, 1045, 1045...
## $ `median-household-income`    <dbl> 45933, 45933, 45933, 45933, 45933, 60096,...
## $ `median-property-value`      <dbl> 178800, 178800, 178800, 178800, 178800, 3...
## $ `rent-burden`                <dbl> 27.7, 27.7, 27.7, 27.7, 27.7, 32.8, 32.8,...
## $ `pct-white`                  <dbl> 93.38, 93.38, 93.38, 93.38, 93.38, 93.05,...
## $ `pct-af-am`                  <dbl> 1.73, 1.73, 1.73, 1.73, 1.73, 1.90, 1.90,...
## $ `pct-hispanic`               <dbl> 1.35, 1.35, 1.35, 1.35, 1.35, 1.76, 1.76,...
## $ `pct-am-ind`                 <dbl> 0.52, 0.52, 0.52, 0.52, 0.52, 0.37, 0.37,...
## $ `pct-asian`                  <dbl> 0.61, 0.61, 0.61, 0.61, 0.61, 0.93, 0.93,...
## $ `pct-nh-pi`                  <dbl> 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01,...
## $ `pct-multiple`               <dbl> 1.56, 1.56, 1.56, 1.56, 1.56, 1.58, 1.58,...
## $ `pct-other`                  <dbl> 0.83, 0.83, 0.83, 0.83, 0.83, 0.40, 0.40,...
## $ `eviction-filings`           <dbl> NA, NA, NA, 767, NA, NA, NA, NA, NA, NA, ...
## $ evictions                    <dbl> NA, NA, NA, 751, NA, NA, NA, NA, NA, NA, ...
## $ `eviction-rate`              <dbl> NA, NA, NA, 3.54, NA, NA, NA, NA, NA, NA,...
## $ `eviction-filing-rate`       <dbl> NA, NA, NA, 3.61, NA, NA, NA, NA, NA, NA,...
## $ `low-flag`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ imputed                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ subbed                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

All of our variables appear to have the data types we’d expect for them (e.g., dbl for variables we would think would be numerical and chr for those we expect to be text).

Step 3: Check the range of values

We’ll also want to check to see if there are any extreme values in our dataset that may have been the result of information being entered incorrectly. To do that, we’ll use the summary() function:

summary(ma_counties)

##      GEOID            year          name           parent-location
##  Min.   :25001   Min.   :2000   Length:238         Length:238
##  1st Qu.:25007   1st Qu.:2004   Class :character   Class :character
##  Median :25014   Median :2008   Mode  :character   Mode  :character
##  Mean   :25014   Mean   :2008
##  3rd Qu.:25021   3rd Qu.:2012
##  Max.   :25027   Max.   :2016
##
##    population       poverty-rate   renter-occupied-households
##  Min.   :   9520   Min.   : 3.61   Min.   :  1365
##  1st Qu.: 130168   1st Qu.: 5.62   1st Qu.: 18481
##  Median : 470496   Median : 7.40   Median : 57391
##  Mean   : 466733   Mean   : 8.16   Mean   : 69080
##  3rd Qu.: 723419   3rd Qu.: 9.45   3rd Qu.:101985
##  Max.   :1556116   Max.   :18.99   Max.   :237937
##
##  pct-renter-occupied median-gross-rent median-household-income
##  Min.   :18.91       Min.   : 499.0    Min.   :39047
##  1st Qu.:29.20       1st Qu.: 743.2    1st Qu.:49956
##  Median :33.59       Median : 853.0    Median :56842
##  Mean   :33.88       Mean   : 946.7    Mean   :59647
##  3rd Qu.:37.00       3rd Qu.:1141.0    3rd Qu.:65313
##  Max.   :66.07       Max.   :1648.0    Max.   :88262
##
##  median-property-value  rent-burden      pct-white       pct-af-am
##  Min.   : 116800       Min.   :23.80   Min.   :46.47   Min.   : 0.83
##  1st Qu.: 202700       1st Qu.:26.30   1st Qu.:78.32   1st Qu.: 2.31
##  Median : 273100       Median :29.60   Median :84.93   Median : 3.15
##  Mean   : 331264       Mean   :29.16   Mean   :82.12   Mean   : 4.97
##  3rd Qu.: 392800       3rd Qu.:31.27   3rd Qu.:89.73   3rd Qu.: 6.04
##  Max.   :1000001       Max.   :34.40   Max.   :94.38   Max.   :20.85
##
##   pct-hispanic      pct-am-ind       pct-asian        pct-nh-pi
##  Min.   : 0.630   Min.   :0.0100   Min.   : 0.310   Min.   :0.0000
##  1st Qu.: 2.580   1st Qu.:0.1100   1st Qu.: 1.250   1st Qu.:0.0100
##  Median : 4.360   Median :0.1500   Median : 1.780   Median :0.0200
##  Mean   : 6.953   Mean   :0.2992   Mean   : 3.193   Mean   :0.0213
##  3rd Qu.:10.170   3rd Qu.:0.2200   3rd Qu.: 4.430   3rd Qu.:0.0300
##  Max.   :22.730   Max.   :2.4200   Max.   :10.460   Max.   :0.0700
##
##   pct-multiple     pct-other      eviction-filings   evictions
##  Min.   :0.900   Min.   :0.0000   Min.   :   0.0   Min.   :   0.0
##  1st Qu.:1.340   1st Qu.:0.2025   1st Qu.: 356.8   1st Qu.: 166.8
##  Median :1.630   Median :0.4000   Median :1667.0   Median :1257.5
##  Mean   :1.705   Mean   :0.7345   Mean   :2124.3   Mean   :1333.1
##  3rd Qu.:1.900   3rd Qu.:1.1600   3rd Qu.:3555.5   3rd Qu.:2448.5
##  Max.   :3.180   Max.   :4.9000   Max.   :7301.0   Max.   :3589.0
##                                   NA's   :76       NA's   :76
##  eviction-rate   eviction-filing-rate    low-flag         imputed
##  Min.   :0.000   Min.   :0.000        Min.   :0.0000   Min.   :0.00000
##  1st Qu.:1.252   1st Qu.:1.745        1st Qu.:0.0000   1st Qu.:0.00000
##  Median :1.555   Median :2.695        Median :0.0000   Median :0.00000
##  Mean   :1.865   Mean   :2.941        Mean   :0.3151   Mean   :0.04622
##  3rd Qu.:2.502   3rd Qu.:3.985        3rd Qu.:1.0000   3rd Qu.:0.00000
##  Max.   :4.650   Max.   :7.470        Max.   :1.0000   Max.   :1.00000
##  NA's   :76      NA's   :76
##      subbed
##  Min.   :0
##  1st Qu.:0
##  Median :0
##  Mean   :0
##  3rd Qu.:0
##  Max.   :0
##

As with some of the data we previously looked at, there is a fair amount of missing data here for some of the variables (e.g., eviction-rate). Given that the number of NAs is pretty consistent (76), I’m guessing that data simply are not available for some counties at given years. But, it may also mean that we’ll need to exclude some counties from the analysis altogether, which would merit a note to the reader.

Visual Data Exploration

To begin visually inspecting our data, we can turn to the DataExplorer package. (You may need to install DataExplorer by going to Tools, Install Packages, and entering in that package name.)

DataExplorer helps us do things like visualize missing values and automatically generate histograms, scatter plots, boxplots, and correlation matrices for all the variables in a given dataset.

The reason this is valuable is because humans have a much easier time processing large amounts of information when it is visualized in some way. Data tables are great for looking at individual trees but visuals help us see the forest.

Additionally, visually inspecting data is great for finding outliers (extreme observations). Sometimes the story is not in the general pattern but in the outlier. For example, you may wonder why a particular city has an exceptionally high crime rate. (However, outliers may also be data entry errors that we need to clean up.)

Let’s load DataExplorer first to get access to its functions.

library(DataExplorer)

You will need to install the DataExplorer package if you haven’t already. In RStudio, you can do that by going to Tools, Install Packages, and selecting it for installation.

Visualizing Missing Values

While we can use the base::summary() function to get the number of NA values for each variable, DataExplorer provides us with a plot_missing() function that visualizes that nicely for us. That function just takes a single argument: the name of the data frame you want to analyze.

In our case, we’re currently working with the ma_counties data frame, so we’ll supply that as the lone argument:

plot_missing(ma_counties)

A DataExplorer plot showing missing data

Behold, we get a simple graph that tells us that we have data for almost all of our variables. As we saw earlier, four of the variables have some missing data (actually, a fair amount of missing data!).

The fact that the values are equal (31.93%) for all four variables suggests some entries have missing data for all those variables—though it is possible that some entries have data for eviction-rate but not eviction-filing-rate.

Digging Deeper

We could, of course, quickly examine that assumption:

ma_counties %>%
  filter(is.na(`eviction-filing-rate`) == TRUE | is.na(`eviction-rate`) == TRUE | is.na(`evictions`) == TRUE | is.na(`eviction-filings`) == TRUE) %>%
  select(year, name, `eviction-filing-rate`, `eviction-rate`, evictions, `eviction-filings`)

year	name	eviction-filing-rate	eviction-rate	evictions	eviction-filings
2000	Barnstable County	NA	NA	NA	NA
2001	Barnstable County	NA	NA	NA	NA
2002	Barnstable County	NA	NA	NA	NA
2004	Barnstable County	NA	NA	NA	NA
2005	Barnstable County	NA	NA	NA	NA
2006	Barnstable County	NA	NA	NA	NA
2007	Barnstable County	NA	NA	NA	NA
2008	Barnstable County	NA	NA	NA	NA
2009	Barnstable County	NA	NA	NA	NA
2010	Barnstable County	NA	NA	NA	NA
2011	Barnstable County	NA	NA	NA	NA
2012	Barnstable County	NA	NA	NA	NA
2013	Barnstable County	NA	NA	NA	NA
2000	Berkshire County	NA	NA	NA	NA
2001	Berkshire County	NA	NA	NA	NA
2002	Berkshire County	NA	NA	NA	NA
2003	Berkshire County	NA	NA	NA	NA
2004	Berkshire County	NA	NA	NA	NA
2005	Berkshire County	NA	NA	NA	NA
2006	Berkshire County	NA	NA	NA	NA
2000	Bristol County	NA	NA	NA	NA
2001	Bristol County	NA	NA	NA	NA
2002	Bristol County	NA	NA	NA	NA
2003	Bristol County	NA	NA	NA	NA
2004	Bristol County	NA	NA	NA	NA
2005	Bristol County	NA	NA	NA	NA
2006	Bristol County	NA	NA	NA	NA
2007	Bristol County	NA	NA	NA	NA
2000	Dukes County	NA	NA	NA	NA
2001	Dukes County	NA	NA	NA	NA
2002	Dukes County	NA	NA	NA	NA
2003	Dukes County	NA	NA	NA	NA
2004	Dukes County	NA	NA	NA	NA
2005	Dukes County	NA	NA	NA	NA
2006	Dukes County	NA	NA	NA	NA
2007	Dukes County	NA	NA	NA	NA
2000	Essex County	NA	NA	NA	NA
2001	Essex County	NA	NA	NA	NA
2003	Essex County	NA	NA	NA	NA
2004	Essex County	NA	NA	NA	NA
2005	Essex County	NA	NA	NA	NA
2006	Essex County	NA	NA	NA	NA
2007	Essex County	NA	NA	NA	NA
2008	Essex County	NA	NA	NA	NA
2009	Essex County	NA	NA	NA	NA
2000	Franklin County	NA	NA	NA	NA
2001	Franklin County	NA	NA	NA	NA
2000	Hampden County	NA	NA	NA	NA
2000	Hampshire County	NA	NA	NA	NA
2000	Middlesex County	NA	NA	NA	NA
2001	Middlesex County	NA	NA	NA	NA
2002	Middlesex County	NA	NA	NA	NA
2003	Middlesex County	NA	NA	NA	NA
2004	Middlesex County	NA	NA	NA	NA
2000	Nantucket County	NA	NA	NA	NA
2001	Nantucket County	NA	NA	NA	NA
2002	Nantucket County	NA	NA	NA	NA
2003	Nantucket County	NA	NA	NA	NA
2004	Nantucket County	NA	NA	NA	NA
2000	Norfolk County	NA	NA	NA	NA
2001	Norfolk County	NA	NA	NA	NA
2000	Plymouth County	NA	NA	NA	NA
2001	Plymouth County	NA	NA	NA	NA
2002	Plymouth County	NA	NA	NA	NA
2003	Plymouth County	NA	NA	NA	NA
2004	Plymouth County	NA	NA	NA	NA
2005	Plymouth County	NA	NA	NA	NA
2006	Plymouth County	NA	NA	NA	NA
2007	Plymouth County	NA	NA	NA	NA
2000	Suffolk County	NA	NA	NA	NA
2003	Suffolk County	NA	NA	NA	NA
2004	Suffolk County	NA	NA	NA	NA
2005	Suffolk County	NA	NA	NA	NA
2006	Suffolk County	NA	NA	NA	NA
2007	Suffolk County	NA	NA	NA	NA
2000	Worcester County	NA	NA	NA	NA

We use the pipe | character in our filter to convey an “OR” logical statement. That is, we’re saying, if eviction-filing-rate is an NA value or if eviction-rate is an NA value, and so on, then include the value in our filtered data frame.

Our assumption is correct: the missing data align across those four variables. But that’s a whole lot of data that’s missing. We can quickly examine which counties are even usable:

ma_counties %>%
  select(year, name, `eviction-filing-rate`, `eviction-rate`, evictions, `eviction-filings`) %>%
  filter(complete.cases(.) == FALSE) %>%
  count(name) %>%
  arrange(desc(n))

name	n
Barnstable County	13
Essex County	9
Bristol County	8
Dukes County	8
Plymouth County	8
Berkshire County	7
Suffolk County	6
Middlesex County	5
Nantucket County	5
Franklin County	2
Norfolk County	2
Hampden County	1
Hampshire County	1
Worcester County	1

filter(complete.cases(.) == FALSE) accomplishes the equivalent of the longer filter() line in the previous code chunk. Specifically, base::complete.cases() tests if the observation has any missing data, outputing TRUE if there are no missing values and FALSE if there is one or more. That function expects you to specify a data frame as the argument; we use . to signal that it should use the piped output from the previous step (the select() statement). Additionally, because we dropped several variables (e.g., population) by not selecting them, the complete.cases() function will only test for the few variables we selected.

It seems that some counties (like Barnstable County and Essex County) have a lot of missing data. Thus, we may not be able to make comparisons for some of the years across all counties.

This doesn’t mean these data are bad but it does mean that we’ll need to be careful about what we can say from our analysis of these data.

Generating Histograms

We may also want to check the frequency of values for the many variables in our dataset. This can be really handy for identifying those outliers and better understanding the distribution of values. For that, a histogram is very handy.

Here, we can use DataExplorer’s plot_histogram() function. That function requires a single argument (although there are some additional optional ones): the data frame from which to plot. The function will then go through that data frame and figure out which variables are sensible to plot on a histogram.

plot_histogram(ma_counties)

A DataExplorer histogram of all the variables (Part 1)

A DataExplorer historgram of all the variables (Part 2)

Notice that the plot_histogram() function automatically renamed our variables. For example, eviction-filings now shows up as eviction.filings. Our original data were not modified, but if we wanted to pipe that data further, we would need to use the new variable names.

The function produced a histogram for almost all of our variables. Some of those variables are kind of pointless (like GEOID) but we’re just exploring and can quickly ignore them. For some variables, like pct.nh.pi, the X axis isn’t very helpful—but we do see that those values tend to be very small across entries.

We’re generating those data for all the observations in our dataset, though, which includes a lot of semi-repetition (we wouldn’t expect the percentage of Hispanics to change tremendously from year to year for each county).

What if we wanted to generate these histograms, but only for a single year (2014)?

We would just have to use our filter() function once again, and pipe that output (a data frame) to plot_histogram(). Here’s how we could do that:

ma_counties %>%
  filter(year==2014) %>%
  plot_histogram()

A DataExplorer histogram limited to a single year (Part 1)

A DataExplorer histogram limited to a single year (Part 2)

Immediately, we see, for example, that almost all counties have fewer than 40% of their households living in rental units (pct.renter.occupied). One county, however, has more than 60%. Maybe there’s something interesting about that county, and we can start digging into it through our exploratory data analysis.

Generating Bar Plots

DataExplorer also provides a function, plot_bar() that will generate frequencies for discrete variables (those identified as strings (chr or factor types)).

It works in a similar fashion to plot_histogram(). However, if we only have one chr-type variable of interest (name), we can just call it using our regular object$variable nomenclature:

plot_bar(ma_counties$name)

A DataExplorer bar chart of the name variable

This doesn’t tell us anything too useful in this case, except that each county name comes up an equal number of times in our dataset.

However, if you were interested in which names (string-type) came up most often in a dataset about campaign contributions, for example, this would be a very useful tool for visually exploring things.

Generating Your Own Plots

The DataExplorer functions we’ve looked at so far can be called “wrapper” functions because they’re primarily geared toward expediting common uses of other functions. Specifically, DataExplorer makes extensive use of the ggplot2 visualization package, wrapping a bunch of ggplot2’s functions into a single one like plot_histogram().

ggplot2 is quite powerful and can get pretty complicated, and we’ll cover some of its more advanced later in the book. For now, we just care about producing some basic plots with little care for visual appeal—it’s just for exploration after all—so we’ll keep things simple.

First, load the ggplot2 package (if you loaded tidyverse previously, you don’t need to load ggplot2 as it is part of the tidyverse metapackage):

library(ggplot2)

ggplot2 works by layering information. For example, we might first set up a basic plot layer that defines the aesthetics; then add a layer with points; then add a layer with lines connecting the points; and finally add a text layer with a title. Those four layers, together, would comprise a line graph.

Creating a Line Graph

Let’s illustrate that by producing a graphic that addresses the following question: What is the trajectory of eviction filing rates in Hampshire County across the years in our dataset?

As a reminder, here are the variables in our data frame:

glimpse(ma_counties)

## Rows: 238
## Columns: 27
## $ GEOID                        <dbl> 25001, 25001, 25001, 25001, 25001, 25001,...
## $ year                         <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006,...
## $ name                         <chr> "Barnstable County", "Barnstable County",...
## $ `parent-location`            <chr> "Massachusetts", "Massachusetts", "Massac...
## $ population                   <dbl> 222230, 222230, 222230, 222230, 222230, 2...
## $ `poverty-rate`               <dbl> 6.89, 6.89, 6.89, 6.89, 6.89, 4.31, 4.31,...
## $ `renter-occupied-households` <dbl> 21035, 21096, 21157, 21218, 21279, 21340,...
## $ `pct-renter-occupied`        <dbl> 22.18, 22.18, 22.18, 22.18, 22.18, 18.91,...
## $ `median-gross-rent`          <dbl> 723, 723, 723, 723, 723, 1045, 1045, 1045...
## $ `median-household-income`    <dbl> 45933, 45933, 45933, 45933, 45933, 60096,...
## $ `median-property-value`      <dbl> 178800, 178800, 178800, 178800, 178800, 3...
## $ `rent-burden`                <dbl> 27.7, 27.7, 27.7, 27.7, 27.7, 32.8, 32.8,...
## $ `pct-white`                  <dbl> 93.38, 93.38, 93.38, 93.38, 93.38, 93.05,...
## $ `pct-af-am`                  <dbl> 1.73, 1.73, 1.73, 1.73, 1.73, 1.90, 1.90,...
## $ `pct-hispanic`               <dbl> 1.35, 1.35, 1.35, 1.35, 1.35, 1.76, 1.76,...
## $ `pct-am-ind`                 <dbl> 0.52, 0.52, 0.52, 0.52, 0.52, 0.37, 0.37,...
## $ `pct-asian`                  <dbl> 0.61, 0.61, 0.61, 0.61, 0.61, 0.93, 0.93,...
## $ `pct-nh-pi`                  <dbl> 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01,...
## $ `pct-multiple`               <dbl> 1.56, 1.56, 1.56, 1.56, 1.56, 1.58, 1.58,...
## $ `pct-other`                  <dbl> 0.83, 0.83, 0.83, 0.83, 0.83, 0.40, 0.40,...
## $ `eviction-filings`           <dbl> NA, NA, NA, 767, NA, NA, NA, NA, NA, NA, ...
## $ evictions                    <dbl> NA, NA, NA, 751, NA, NA, NA, NA, NA, NA, ...
## $ `eviction-rate`              <dbl> NA, NA, NA, 3.54, NA, NA, NA, NA, NA, NA,...
## $ `eviction-filing-rate`       <dbl> NA, NA, NA, 3.61, NA, NA, NA, NA, NA, NA,...
## $ `low-flag`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ imputed                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ subbed                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Selecting the data

Let’s start by producing a data frame that only includes information from Hampshire County. We can do that by piping our data to a filter() statement that only includes the desired data:

ma_counties %>%
  filter(name=="Hampshire County")

GEOID	year	name	parent-location	population	poverty-rate	renter-occupied-households	pct-renter-occupied	median-gross-rent	median-household-income	median-property-value	rent-burden	pct-white	pct-af-am	pct-hispanic	pct-am-ind	pct-asian	pct-nh-pi	pct-multiple	pct-other	eviction-filings	evictions	eviction-rate	eviction-filing-rate	low-flag	imputed
25015	2000	Hampshire County	Massachusetts	152251	9.40	19621	35.05	631	46098	142400	26.3	89.54	1.80	3.42	0.16	3.39	0.04	1.46	0.19	NA	NA	NA	NA	0	0
25015	2001	Hampshire County	Massachusetts	152251	9.40	19629	35.05	631	46098	142400	26.3	89.54	1.80	3.42	0.16	3.39	0.04	1.46	0.19	113	102	0.52	0.58	1	0
25015	2002	Hampshire County	Massachusetts	152251	9.40	19637	35.05	631	46098	142400	26.3	89.54	1.80	3.42	0.16	3.39	0.04	1.46	0.19	98	79	0.40	0.50	1	0
25015	2003	Hampshire County	Massachusetts	152251	9.40	19645	35.05	631	46098	142400	26.3	89.54	1.80	3.42	0.16	3.39	0.04	1.46	0.19	121	97	0.49	0.62	1	1
25015	2004	Hampshire County	Massachusetts	152251	9.40	19653	35.05	631	46098	142400	26.3	89.54	1.80	3.42	0.16	3.39	0.04	1.46	0.19	249	197	1.00	1.27	0	1
25015	2005	Hampshire County	Massachusetts	155160	5.28	19661	32.08	847	57293	252500	32.3	87.92	2.08	4.36	0.15	3.67	0.01	1.58	0.24	343	247	1.26	1.74	0	0
25015	2006	Hampshire County	Massachusetts	155160	5.28	19669	32.08	847	57293	252500	32.3	87.92	2.08	4.36	0.15	3.67	0.01	1.58	0.24	285	259	1.32	1.45	0	0
25015	2007	Hampshire County	Massachusetts	155160	5.28	19677	32.08	847	57293	252500	32.3	87.92	2.08	4.36	0.15	3.67	0.01	1.58	0.24	362	266	1.35	1.84	0	0
25015	2008	Hampshire County	Massachusetts	155160	5.28	19685	32.08	847	57293	252500	32.3	87.92	2.08	4.36	0.15	3.67	0.01	1.58	0.24	462	258	1.31	2.35	0	0
25015	2009	Hampshire County	Massachusetts	155160	5.28	19693	32.08	847	57293	252500	32.3	87.92	2.08	4.36	0.15	3.67	0.01	1.58	0.24	513	270	1.37	2.60	0	0
25015	2010	Hampshire County	Massachusetts	158080	5.99	19701	33.56	906	61264	263500	31.1	86.19	2.24	4.72	0.15	4.51	0.03	1.95	0.22	478	252	1.28	2.43	0	0
25015	2011	Hampshire County	Massachusetts	160759	6.17	19951	34.16	965	61368	265400	32.6	84.93	2.60	5.15	0.12	5.27	0.04	1.82	0.08	583	266	1.33	2.92	0	0
25015	2012	Hampshire County	Massachusetts	160759	6.17	20202	34.16	965	61368	265400	32.6	84.93	2.60	5.15	0.12	5.27	0.04	1.82	0.08	601	259	1.28	2.98	0	0
25015	2013	Hampshire County	Massachusetts	160759	6.17	20452	34.16	965	61368	265400	32.6	84.93	2.60	5.15	0.12	5.27	0.04	1.82	0.08	524	171	0.84	2.56	0	0
25015	2014	Hampshire County	Massachusetts	160759	6.17	20702	34.16	965	61368	265400	32.6	84.93	2.60	5.15	0.12	5.27	0.04	1.82	0.08	523	166	0.80	2.53	0	0
25015	2015	Hampshire County	Massachusetts	160759	6.17	20953	34.16	965	61368	265400	32.6	84.93	2.60	5.15	0.12	5.27	0.04	1.82	0.08	491	147	0.70	2.34	0	0
25015	2016	Hampshire County	Massachusetts	160759	6.17	21203	34.16	965	61368	265400	32.6	84.93	2.60	5.15	0.12	5.27	0.04	1.82	0.08	479	122	0.58	2.26	0	1

Adding the base layer

Once we have just the data we want to visualize, we’ll proceed to creating our visualization. With ggplot2, this begins by calling the ggplot() function and specifying dataset you want to use (first argument, which is data= by default) and the key aesthetic variables you’d like to use by default on subsequent layers (which we supply via the aes() function in the second argument, which is mapping= by default).

Common top-level aesthetic mappings that we can set within the aes() function include:

x: the variable that goes in the X axis
y: the variable that goes in the Y axis (optional)
group: any grouping variable to signal a connection between data points (optional)
color: any variable for color-coding data points’ outside edges (optional)
fill: any variable for color-coding data points’ inside fills (optional)

In this case, we’ll create a basic base layer that only maps the x and y variables (the years and the eviction filing rates). Let’s pipe our information to the ggplot() function (which will automatically fill in the information about the data frame object) and fill in the desired aesthetics:

ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`))

A line graph with just a base layer

Our plot now has scaled values for the X axis and Y axis (look at the labels placed on both axes), but there’s no data plotted on it just yet.

Adding points

To add some points on that plot, we’ll add another layer by putting a + sign at the end of the previous layer. (Note that we’re no longer piping the information. We are adding layers to our ggplot.)

Then, we’ll add in points by simply applying the geom_point() function as our second layer.

ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`)) +
    geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

A line graph with a second layer that includes points

Note: R tells us that it removed a row because we don’t have data for eviction-filing-rate in the year 2000. If we had removed observations with NA values, that warning would not have been given.

Now we have points! Next up, we’ll add a line to connect our points.

Adding a line

We can place a line to connect our points by adding a third layer. Similar to our previous layer, we’ll just apply the geom_line() function to that new layer.

ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`)) +
    geom_point() +
    geom_line()

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 row(s) containing missing values (geom_path).

A line graph with a third layer that includes a line

Note: We could do without the geom_point() layer in creating a line graph, but it’s often helpful to include some marker for each data point.

Creating other charts

We can replicate our earlier charts using ggplot2 functions like geom_histogram() and geom_bar().

We’ll cover ggplot2’s more advanced features later in the book, but here’s a cheatsheet if you already want to do more with it. For now, quick and dirty is all we need.

Adding interactivity

The nice thing about producing our own plots using ggplot2 is the fact that we can easily turn them into interactive plots by pairing them with the ggplotly() function from the plotly library, provided by the website Plot.ly. (You may need to install the library first.)

As usual, we begin by loading the package:

library(plotly)

##
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
##
##     last_plot

## The following object is masked from 'package:stats':
##
##     filter

## The following object is masked from 'package:graphics':
##
##     layout

The plotly library has some more advanced features, but it can usually take a given ggplot2 object and automatically figure out which aspects of the plot could be enhanced through interactive elements.

The plotly library is a work in progress and authored by different people from ggplot2. Thus, some advanced ggplot2 aesthetics, like custom annotations, work poorly with plotly at the time of writing. For exploratory graphics, this isn’t usually an issue but it can be problematic if you want to produce production-level ggplot2 graphics.

Applying `ggplotly()`

Let’s recreate the line graph from above but make it so I can get precise values just by hovering over each point. First, we’ll store our ggplot in an object called g. Then, we will supply that object we just created as the lone argument for ggplotly():

g <- ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`)) +
    geom_point() +
    geom_line()

ggplotly(g)

A screenshot of the interactive plot

The above is just a screenshot of the interactive version that you should get when running the code.

By using these exploratory visualization tactics, you should be able to start identifying even more potential issues with your data, questions that you can explore in greater depth with follow-up analysis, and potential trends that you might have missed if you were just looking at a big table packed with numbers.

Exploratory Data Visualization in R

Introduction

The Dataset

Pre-Analysis Steps

Load the data

Confirm it imported correctly

Step 1: Check the first few observations

Step 2: Check the structure

Step 3: Check the range of values

Visual Data Exploration

Visualizing Missing Values

Digging Deeper

Generating Histograms

Generating Bar Plots

Generating Your Own Plots

Creating a Line Graph

Selecting the data

Adding the base layer

Adding points

Adding a line

Creating other charts

Adding interactivity

Applying ggplotly()

Applying `ggplotly()`