Now that we have learned some of the basics of R, let’s try loading a small, real-world dataset. In this tutorial, you will be learning:
How to load data from a CSV file into a data frame
How to quickly check for errors in the importing of data
How to select information from a variable in a data frame
We’ll be primarily working with data frames over the course of the book. (Variations of this include data tables and tibbles. If you see those terms, they generally refer to the idea of a data frame. Although there are some technical differences, that won’t be an issue in this book.)
Data frames organize information into rows and columns, much like a spreadsheet you’d find in Microsoft Excel or Google Sheets. Typically, each column will refer to a different variable (e.g., name, age, major) and each row will refer to a different observation (e.g., a student).
We can create a data frame within R at any point but now would be a good time to practice bringing a dataset into R. In order to do that, we’ll load a CSV file (comma-separated values) using the readr::read_csv()
function.
We can get access to the read_csv()
function by loading the metapackage called tidyverse
(which loads readr
alongside some other highly useful packages) by entering the following code:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Don’t be scared by the lengthy output resulting from loading the package. It is just telling us that the tidyverse
package has loaded several sub-packages (e.g., ggplot2
and tibble
). The conflicts part just warns us that some tidyverse
functions will now override some previously loaded functions of the same name. For example, if we use the filter()
function now, it will come from the dplyr
package and not the stats
package (which is loaded by default). To use the stats
version of the function, we’d write: stats::filter()
.
Now that we’ve loaded readr
, let’s open our first CSV file.
We can load data remotely (i.e., we’re downloading it from a website) or locally (i.e., we’ve saved it on our computer). For most of our exercises, we’ll load data remotely.
Data journalists will typically keep a local copy of a dataset they use because you never know if someone will take it down or modify it without your knowledge. For the purposes of the tutorials in this book, we will be loading the data remotely.
As an illustration, we’ll be loading data about the operating expenses and revenues for five collegiate soccer programs in Massachusetts, which I obtained from the U.S. Department of Education’s Equity in Athletics Data Analysis program.
We will use the readr::read_csv()
function to load a CSV file from this URL and store it in the object I’m calling soccer_expenditures
.
Do note that we are supplying that lone argument as a string, and we thus need to place it within quotation marks ("
). Without the quotation marks, R would look for an object called https ... .csv
within our current R environment — which we do not have.
soccer_expenditures <- read_csv("https://dds.rodrigozamith.com/files/college_soccer_expenditures_2018.csv")
The message underneath the code tells us that it imported eight variables (columns), with the variable name to the left of the equal sign and the data type for that column to the right of the equal sign.
Alternatively, if we had saved that dataset on our computer, we could use the readr::read_csv()
function to load a CSV file named college_soccer_expenditures_2018.csv
(stored in the data/
subfolder of my working directory) and store it in the object I’m calling soccer_expenditures
.
soccer_expenditures <- read_csv("data/college_soccer_expenditures_2018.csv")
## Rows: 6 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): institution_name
## dbl (7): total_undergraduates, men_number_participants_in_team, women_number...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you get an error saying that R can’t find the data, it’s probably because you didn’t set your working directory to the right place (e.g., your project directory).
You can often think of each variable (column) in the data frame as a different vector of information (information about that variable for each of the observations (rows) in our dataset).
We can look at the data in the data frame by simply typing in the object’s name:
soccer_expenditures
institution_name | total_undergraduates | men_number_participants_in_team | women_number_participants_in_team | men_operating_expenses_per_participant | women_operating_expenses_per_participant | men_total_revenue | women_total_revenue |
---|---|---|---|---|---|---|---|
Boston College | 9526 | 30 | 25 | 6434 | 8233 | 1349479 | 1786830 |
Boston University | 16906 | 34 | 30 | 6647 | 8085 | 1352594 | 1711306 |
Northeastern University | 13760 | 28 | 25 | 8378 | 6368 | 1257790 | 1336416 |
University of Massachusetts-Amherst | 21706 | 27 | 26 | 5198 | 5965 | 712371 | 1142737 |
University of Massachusetts-Boston | 9897 | 26 | 27 | 1459 | 1865 | 137753 | 171560 |
University of Massachusetts-Lowell | 10585 | 33 | 26 | 4638 | 5863 | 841686 | 844020 |
If we were in RStudio, an object called soccer_expenditures
would also appear under your Environment tab when we created it while using the read_csv()
function. You could just click on that object to see the data within RStudio’s GUI, which can make it a little easier to navigate.
Congratulations! You have created a data frame using information imported from a CSV file!
The next step is to confirm the data were imported the way we wanted it to. Here, I would recommend that you follow three steps.
Start by checking the first few observations of the object to check if there’s anything odd using the head()
function, which shows just the first six observations in the dataset. (In this case, we only have six schools so it is effectively showing us the entire dataset.)
head(soccer_expenditures)
institution_name | total_undergraduates | men_number_participants_in_team | women_number_participants_in_team | men_operating_expenses_per_participant | women_operating_expenses_per_participant | men_total_revenue | women_total_revenue |
---|---|---|---|---|---|---|---|
Boston College | 9526 | 30 | 25 | 6434 | 8233 | 1349479 | 1786830 |
Boston University | 16906 | 34 | 30 | 6647 | 8085 | 1352594 | 1711306 |
Northeastern University | 13760 | 28 | 25 | 8378 | 6368 | 1257790 | 1336416 |
University of Massachusetts-Amherst | 21706 | 27 | 26 | 5198 | 5965 | 712371 | 1142737 |
University of Massachusetts-Boston | 9897 | 26 | 27 | 1459 | 1865 | 137753 | 171560 |
University of Massachusetts-Lowell | 10585 | 33 | 26 | 4638 | 5863 | 841686 | 844020 |
At first glance, the data seem to have been imported cleanly. Each row represents a different institution (school
).
Next, we’ll use the glimpse()
function to check all the column names and make sure they’re associated with the right data type.
glimpse(soccer_expenditures)
## Rows: 6
## Columns: 8
## $ institution_name <chr> "Boston College", "Boston Uni...
## $ total_undergraduates <dbl> 9526, 16906, 13760, 21706, 98...
## $ men_number_participants_in_team <dbl> 30, 34, 28, 27, 26, 33
## $ women_number_participants_in_team <dbl> 25, 30, 25, 26, 27, 26
## $ men_operating_expenses_per_participant <dbl> 6434, 6647, 8378, 5198, 1459,...
## $ women_operating_expenses_per_participant <dbl> 8233, 8085, 6368, 5965, 1865,...
## $ men_total_revenue <dbl> 1349479, 1352594, 1257790, 71...
## $ women_total_revenue <dbl> 1786830, 1711306, 1336416, 11...
The pillar::glimpse()
function, which is loaded with tidyverse
, is effectively a nicer version of the str()
function.
The first couple of lines tell us that the object (a data frame) has 6 observations (rows) and 8 variables (columns). Then, we get the specification of each variable in the dataset (the lines with $
at the beginning), including the data type for the variable and a glimpse at the first few values. (The dbl
type is roughly equivalent to the num
data type we previously encountered.)
All of our variables appear to have the data types we’d expect for them (e.g., dbl
for variables we would think would be numerical (e.g., total_undergraduates
) and chr
for those we expect to be text (e.g., institution_name
)).
We’ll also want to check to see if there are any extreme values in our dataset that may have been the result of information being entered incorrectly. To do that, we’ll use the summary()
function, which provides us with some summary statistics about each variable. Here is how we would apply that function to our data frame:
summary(soccer_expenditures)
## institution_name total_undergraduates men_number_participants_in_team
## Length:6 Min. : 9526 Min. :26.00
## Class :character 1st Qu.:10069 1st Qu.:27.25
## Mode :character Median :12172 Median :29.00
## Mean :13730 Mean :29.67
## 3rd Qu.:16120 3rd Qu.:32.25
## Max. :21706 Max. :34.00
## women_number_participants_in_team men_operating_expenses_per_participant
## Min. :25.00 Min. :1459
## 1st Qu.:25.25 1st Qu.:4778
## Median :26.00 Median :5816
## Mean :26.50 Mean :5459
## 3rd Qu.:26.75 3rd Qu.:6594
## Max. :30.00 Max. :8378
## women_operating_expenses_per_participant men_total_revenue women_total_revenue
## Min. :1865 Min. : 137753 Min. : 171560
## 1st Qu.:5888 1st Qu.: 744700 1st Qu.: 918699
## Median :6166 Median :1049738 Median :1239576
## Mean :6063 Mean : 941946 Mean :1165478
## 3rd Qu.:7656 3rd Qu.:1326557 3rd Qu.:1617584
## Max. :8233 Max. :1352594 Max. :1786830
Glancing at that output, we see a pretty big range for some of our variables. This makes sense: A larger school will have a higher number of total_undergraduates
and generate more revenue (e.g., women_total_revenue
) as a result of the larger enrollment and the fact the school may participate in a larger collegiate sports conference. However, none of those numbers look to be outside of the expected range: we don’t see any negative revenues or revenue worth billions of dollars. Other variables, like the size of the team (e.g., men_number_participants_in_team
) have a narrower range in line with the expected size of a full team roster.
In short, I can feel reasonably confident that there weren’t any errors resulting from the importing process, and that the data aren’t messy at all in this instance.
Oftentimes, we don’t need to work with all the data in a dataset, only portions that are of interest to us.
To view the observations related to a variable in a data frame, we access it with the following syntax: object$variable
.
For example, here is how we can access the values in the total_undergraduates
variable in the soccer_expenditures
object:
soccer_expenditures$total_undergraduates
## [1] 9526 16906 13760 21706 9897 10585
The result is a vector with the values from the total_undergraduates
variable in the soccer_expenditures
object.
Let’s try taking the mean of the variable women_total_revenue
. Recall that mean()
requires us to feed it a vector with numbers from which the mean will be taken. So, we can combine what we’ve learned thusly:
mean(soccer_expenditures$women_total_revenue)
## [1] 1165478
The result is a vector with a single element, 1165478
. That’s the mean of all the total revenue values for women’s sports in our dataset.