Working with Multiple Datasets


There are occasions when data journalists can write a compelling news story using a single dataset. This might be because they have a very rich data source, or because they are just writing a very simple story.

There are many times, though, when you will probably want to combine different datasets to help you tackle more meaningful issues, and tackle issues more meaningfully.

Many of the datasets you will come across are collected for very narrow purposes, or to address a very specific set of questions that the data collector had in mind. As journalists, we often write for broader audiences, and take an interest in broad implications. So, we might look at one of those datasets and decide it is too limited to be useful.

Instead of just dismissing those data, ask yourself if there might be a way to fill in the gaps with another dataset.

Benefits and Challenges of Combining Datasets

There are several benefits to combining datasets. For one, it allows you to paint a fuller picture of some phenomenon and consider potential connections that others may be missing.

For example, journalists are often interested in the relationship between the characteristics of a community and some challenge faced by that community. They might be interested in whether arrests are localized in neighborhoods with certain racial and ethnic groups, if predatory payday lending locations are concentrated in poorer areas, or if employment rates (or other proxies for community stability) correspond with student performance on standardized tests (a crude proxy for quality of education).

A single dataset is unlikely to be sufficient to tackle any of those interests. However, data for arrests, predatory payday locations, and student performance can be paired with U.S. Census Bureau data that describes the characteristics of communities. Pairing just two datasets can allow a journalist to tackle far more interesting questions than simply listing a ranking of the best schools or creating maps with payday lending locations or where arrests occurred.

However, combining datasets can be challenging if only because the datasets require a common identifier in order to be merged. For example, that dataset about arrests needs to include at least one geographic identifier (e.g., ZIP Code) that is also present in the dataset about neighborhood demographics. That common identifier allows us to link the observations between the two datasets — that is, to ensure that the correct observations from one dataset are being connected to the corresponding observations in the other dataset.

Sometimes, two datasets will lack a common identifier but contain enough information to allow us to aggregate the data to reach correspondence. For example, we might have one dataset that contains county-level data and another that contains state-level data. Since counties exist within state borders, we can aggregate the county-level data and create a common identifier (state).

However, there are times when you’ll come across datasets that include similar identifiers that are nevertheless incompatible. For example, you may come across one dataset with ZIP codes and another with county names. While both are geographic boundaries, a single ZIP code can be a part of two different counties. As thus, only a rough approximation is possible, and that approximation will require a fair bit of work (and manual recoding of the data).

Moreover, because there’s a good chance your two datasets of interest will be created by different parties, there is a good chance they will not include common identifiers. In those instances, you will be unable to combine the datasets but you can still incorporate them into a story in a complementary manner. Put another way, you can be creative about highlighting information from both datasets and explaining what the potential connections between them might be — even if it is impossible to link them directly.

Example: BBC’s Cost of Commuting

Journalists can get quite creative in how they combine data, though. Here’s a simple example from the BBC where the journalist was interested in measuring the cost of commuting for full-time workers in Great Britain.

There was no single dataset that included that cost. However, the journalist was able to use data from the Campaign for Better Transport, an advocacy group, to identify the most commonly-used commuter services in six major cities. They then manually obtained the cost for a season ticket for each of those routes using a calculator from the National Rail website. The journalist then combined that information with data from the UK’s Office for National Statistics covering the median earnings by place of work. Finally, the journalist calculated the distance between the origin and destination for each of those routes using a service from Google, and used the resulting information to calculate a cost per mile traveled.

The result is an interesting, yet short, investigation that sheds light on the high cost for getting to and from work for many Britons. And therein lies the story: The cost of commuting accounts for 1/10th of the average Briton’s net pay, and those costs vary considerably depending on where a Briton lives.

But the only way the journalist fully report that story was to combine information from multiple different data sources. The BBC has kindly put their resulting dataset online, allowing you to get a sense of how they were combining the data (adding variables) along the way.

To be clear, getting information from multiple datasets is not a prerequisite for producing a good data story. However, can certainly lead to a far more interesting one. No matter how rich your dataset is, ask yourself if there are any opportunities to combine it with another dataset to give you even more materials to produce an informative and interesting story.