Loading Data Into R

Introduction

Now that we have learned some of the basics of R, let’s try loading a small, real-world dataset. In this tutorial, you will be learning:

  • How to load data from a CSV file into a data frame

  • How to quickly check for errors in the importing of data

  • How to select information from a variable in a data frame

Meet the Data Frame

We’ll be primarily working with data frames over the course of the book. (Variations of this include data tables and tibbles. If you see those terms, they generally refer to the idea of a data frame. Although there are some technical differences, that won’t be an issue in this book.)

Data frames organize information into rows and columns, much like a spreadsheet you’d find in Microsoft Excel or Google Sheets. Typically, each column will refer to a different variable (e.g., name, age, major) and each row will refer to a different observation (e.g., a student).

We can create a data frame within R at any point but now would be a good time to practice bringing a dataset into R. In order to do that, we’ll load a CSV file (comma-separated values) using the readr::read_csv() function.

We can get access to the read_csv() function by loading the metapackage called tidyverse (which loads readr alongside some other highly useful packages) by entering the following code:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Don’t be scared by the lengthy output resulting from loading the package. It is just telling us that the tidyverse package has loaded several sub-packages (e.g., ggplot2 and tibble). The conflicts part just warns us that some tidyverse functions will now override some previously loaded functions of the same name. For example, if we use the filter() function now, it will come from the dplyr package and not the stats package (which is loaded by default). To use the stats version of the function, we’d write: stats::filter().

Loading Data

Now that we’ve loaded readr, let’s open our first CSV file.

We can load data remotely (i.e., we’re downloading it from a website) or locally (i.e., we’ve saved it on our computer). For most of our exercises, we’ll load data remotely.

Data journalists will typically keep a local copy of a dataset they use because you never know if someone will take it down or modify it without your knowledge. For the purposes of the tutorials in this book, we will be loading the data remotely.

Loading data remotely

As an illustration, we’ll be loading data about the operating expenses and revenues for five collegiate soccer programs in Massachusetts, which I obtained from the U.S. Department of Education’s Equity in Athletics Data Analysis program.

We will use the readr::read_csv() function to load a CSV file from this URL and store it in the object I’m calling soccer_expenditures.

Do note that we are supplying that lone argument as a string, and we thus need to place it within quotation marks ("). Without the quotation marks, R would look for an object called https ...​ .csv within our current R environment — which we do not have.

soccer_expenditures <- read_csv("https://dds.rodrigozamith.com/files/college_soccer_expenditures_2018.csv")

The message underneath the code tells us that it imported eight variables (columns), with the variable name to the left of the equal sign and the data type for that column to the right of the equal sign.

Loading data locally

Alternatively, if we had saved that dataset on our computer, we could use the readr::read_csv() function to load a CSV file named college_soccer_expenditures_2018.csv (stored in the data/ subfolder of my working directory) and store it in the object I’m calling soccer_expenditures.

soccer_expenditures <- read_csv("data/college_soccer_expenditures_2018.csv")
## Rows: 6 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): institution_name
## dbl (7): total_undergraduates, men_number_participants_in_team, women_number...

##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you get an error saying that R can’t find the data, it’s probably because you didn’t set your working directory to the right place (e.g., your project directory).

Viewing a Data Frame

You can often think of each variable (column) in the data frame as a different vector of information (information about that variable for each of the observations (rows) in our dataset).

We can look at the data in the data frame by simply typing in the object’s name:

soccer_expenditures
institution_name total_undergraduates men_number_participants_in_team women_number_participants_in_team men_operating_expenses_per_participant women_operating_expenses_per_participant men_total_revenue women_total_revenue
Boston College 9526 30 25 6434 8233 1349479 1786830
Boston University 16906 34 30 6647 8085 1352594 1711306
Northeastern University 13760 28 25 8378 6368 1257790 1336416
University of Massachusetts-Amherst 21706 27 26 5198 5965 712371 1142737
University of Massachusetts-Boston 9897 26 27 1459 1865 137753 171560
University of Massachusetts-Lowell 10585 33 26 4638 5863 841686 844020

If we were in RStudio, an object called soccer_expenditures would also appear under your Environment tab when we created it while using the read_csv() function. You could just click on that object to see the data within RStudio’s GUI, which can make it a little easier to navigate.

Congratulations! You have created a data frame using information imported from a CSV file!

Double-Checking the Import

The next step is to confirm the data were imported the way we wanted it to. Here, I would recommend that you follow three steps.

Step 1: Check the first few observations

Start by checking the first few observations of the object to check if there’s anything odd using the head() function, which shows just the first six observations in the dataset. (In this case, we only have six schools so it is effectively showing us the entire dataset.)

head(soccer_expenditures)
institution_name total_undergraduates men_number_participants_in_team women_number_participants_in_team men_operating_expenses_per_participant women_operating_expenses_per_participant men_total_revenue women_total_revenue
Boston College 9526 30 25 6434 8233 1349479 1786830
Boston University 16906 34 30 6647 8085 1352594 1711306
Northeastern University 13760 28 25 8378 6368 1257790 1336416
University of Massachusetts-Amherst 21706 27 26 5198 5965 712371 1142737
University of Massachusetts-Boston 9897 26 27 1459 1865 137753 171560
University of Massachusetts-Lowell 10585 33 26 4638 5863 841686 844020

At first glance, the data seem to have been imported cleanly. Each row represents a different institution (school).

Step 2: Check the structure

Next, we’ll use the glimpse() function to check all the column names and make sure they’re associated with the right data type.

glimpse(soccer_expenditures)
## Rows: 6
## Columns: 8
## $ institution_name                         <chr> "Boston College", "Boston Uni...
## $ total_undergraduates                     <dbl> 9526, 16906, 13760, 21706, 98...
## $ men_number_participants_in_team          <dbl> 30, 34, 28, 27, 26, 33
## $ women_number_participants_in_team        <dbl> 25, 30, 25, 26, 27, 26
## $ men_operating_expenses_per_participant   <dbl> 6434, 6647, 8378, 5198, 1459,...
## $ women_operating_expenses_per_participant <dbl> 8233, 8085, 6368, 5965, 1865,...
## $ men_total_revenue                        <dbl> 1349479, 1352594, 1257790, 71...
## $ women_total_revenue                      <dbl> 1786830, 1711306, 1336416, 11...

The pillar::glimpse() function, which is loaded with tidyverse, is effectively a nicer version of the str() function.

The first couple of lines tell us that the object (a data frame) has 6 observations (rows) and 8 variables (columns). Then, we get the specification of each variable in the dataset (the lines with $ at the beginning), including the data type for the variable and a glimpse at the first few values. (The dbl type is roughly equivalent to the num data type we previously encountered.)

All of our variables appear to have the data types we’d expect for them (e.g., dbl for variables we would think would be numerical (e.g., total_undergraduates) and chr for those we expect to be text (e.g., institution_name)).

Step 3: Check the range of values

We’ll also want to check to see if there are any extreme values in our dataset that may have been the result of information being entered incorrectly. To do that, we’ll use the summary() function, which provides us with some summary statistics about each variable. Here is how we would apply that function to our data frame:

summary(soccer_expenditures)
##  institution_name   total_undergraduates men_number_participants_in_team
##  Length:6           Min.   : 9526        Min.   :26.00
##  Class :character   1st Qu.:10069        1st Qu.:27.25
##  Mode  :character   Median :12172        Median :29.00
##                     Mean   :13730        Mean   :29.67
##                     3rd Qu.:16120        3rd Qu.:32.25
##                     Max.   :21706        Max.   :34.00
##  women_number_participants_in_team men_operating_expenses_per_participant
##  Min.   :25.00                     Min.   :1459
##  1st Qu.:25.25                     1st Qu.:4778
##  Median :26.00                     Median :5816
##  Mean   :26.50                     Mean   :5459
##  3rd Qu.:26.75                     3rd Qu.:6594
##  Max.   :30.00                     Max.   :8378
##  women_operating_expenses_per_participant men_total_revenue women_total_revenue
##  Min.   :1865                             Min.   : 137753   Min.   : 171560
##  1st Qu.:5888                             1st Qu.: 744700   1st Qu.: 918699
##  Median :6166                             Median :1049738   Median :1239576
##  Mean   :6063                             Mean   : 941946   Mean   :1165478
##  3rd Qu.:7656                             3rd Qu.:1326557   3rd Qu.:1617584
##  Max.   :8233                             Max.   :1352594   Max.   :1786830

Glancing at that output, we see a pretty big range for some of our variables. This makes sense: A larger school will have a higher number of total_undergraduates and generate more revenue (e.g., women_total_revenue) as a result of the larger enrollment and the fact the school may participate in a larger collegiate sports conference. However, none of those numbers look to be outside of the expected range: we don’t see any negative revenues or revenue worth billions of dollars. Other variables, like the size of the team (e.g., men_number_participants_in_team) have a narrower range in line with the expected size of a full team roster.

In short, I can feel reasonably confident that there weren’t any errors resulting from the importing process, and that the data aren’t messy at all in this instance.

Working with Variables

Oftentimes, we don’t need to work with all the data in a dataset, only portions that are of interest to us.

To view the observations related to a variable in a data frame, we access it with the following syntax: object$variable.

For example, here is how we can access the values in the total_undergraduates variable in the soccer_expenditures object:

soccer_expenditures$total_undergraduates
## [1]  9526 16906 13760 21706  9897 10585

The result is a vector with the values from the total_undergraduates variable in the soccer_expenditures object.

Combining variables and functions

Let’s try taking the mean of the variable women_total_revenue. Recall that mean() requires us to feed it a vector with numbers from which the mean will be taken. So, we can combine what we’ve learned thusly:

mean(soccer_expenditures$women_total_revenue)
## [1] 1165478

The result is a vector with a single element, 1165478. That’s the mean of all the total revenue values for women’s sports in our dataset.