Evaluating Data Sources

Introduction

Because data journalists so often rely on data from third parties, they need to take care to ensure the data are coming from a reliable source. After all, there are many unscrupulous people out there who want to take advantage of the mythology around data, and use shoddy methods or the selective release of information to advance their own agenda. Additionally, even well-meaning people may be poor custodians of data, and may therefore calculate information in erroneous ways, merge data incorrectly, or write a scraper that is filled with errors.

Regardless, if a data journalist is not careful, they will end up repeating (and perhaps reinforcing) the mistakes. Not only is it embarrassing to let someone take advantage of you as a data journalist, but it is also dangerous since journalism helps inform public opinion.

Thus, journalists learn early on that they should approach every claim from a source (or dataset) with skepticism. As the old newsroom saying goes, “If your mother says loves you, check it out.” (Although you will want to be careful with that one because it could lead to a lifetime of emotional trauma.)

Dimensions for Evaluating Data Sources

When you come across a dataset that looks interesting, you will want to evaluate its source and then perform some basic sanity checks to make sure it is trustworthy.

A good way to quickly perform that assessment is to evaluate the data and its source across five dimensions: verifiability, internal consistency, recency, ambiguity, and reputation.

To help you remember these five dimensions, you can use the acronym VIRAR. That’s Portuguese for spinning, or turning around. If you didn’t know that, now you do!

Verifiability

Let’s start with verifiability. When you come across a dataset, you will want to ask yourself: Can I find the methodology for how these data were collected?

For example, if you come across a survey that does not reveal details like how many people were contacted, how many of those people responded, or the approach the researchers used to identify those people, then you will want to be cautious about using those data.

Alternatively, let’s say a dataset aggregates data from multiple sources of information. Is it clear where those data came from? Is there sufficient information to enable you to access the original data yourself to verify that the data were merged correctly (or that the computed values are correct)?

In general, well-documented data sources tend to be those generated (and checked) with the most rigor, and it is a very positive indicator. (However, it is just that―an indicator. Don’t let the allure of technical words and equations be a substitute for passing the eye test.)

Internal Consistency

When evaluating internal consistency, you will want to take a look at a copy of the dataset and ask yourself: Are there any clear errors in the data?

For example, are there repeated variations of values within a single category? If we are interested in the variable gender and have a category for female, do we have instances in the dataset where the value is the string female some of the time, fe male at other times, and fmale on several occasions?

Similarly, do you see a lot of missing values in the data that are not explained anywhere? Or, do you see labels and codes in the data that don’t seem to correspond to any of the documentation for that data?

Finally, are there any unexplained outliers in the dataset? For example, if one of your variables is class size and most classes range from 8 to 60, do you see a few classes in the hundreds? Data entry errors are not uncommon with large datasets — especially if they had to be digitized from handwritten forms and materials — but lots of errors should give you pause.

Generally, these kinds of issues are signs of sloppiness, which a data journalist should treat as a red flag. However, again, don’t rely solely on this indicator. It is possible that an archivist intentionally opted not to clean any of the data so it exactly replicates what was in a database (or hand-written documents that were subsequently digitized). As such, the data source may be trustworthy but its data may simply require some cleaning up before it can be properly analyzed.

Recency

Recency is most relevant when you are trying to use data to illustrate something about the current moment, which is a common feature of many journalistic stories. With that dimension, you will want to ask yourself: Are the data recent enough to be a good approximation of the present?

After all, old data may no longer reflect reality.

Consider political polls during elections. In late June 2015, Jeb Bush led the Republican primaries for the Presidential Election, and neither Ted Cruz nor Donald Trump were among the top six contenders. Within a month of that, Trump became the clear front-runner. Eight years earlier, Hillary Clinton held a 27-point lead over Barack Obama, who became the eventual nominee.

While political races are especially volatile, public opinion can change quickly as well. If you were to use poll data on same-sex marriage in mid-2009, you would find that just 40 percent of Americans believe those marriages should be valid. Just two years later, 53 percent held that belief. As of 2019, 61 percent of Americans support same-sex marriage.

Some datasets, especially those produced by private sources, may not become public until an embargo period has passed, or until the data have been analyzed. This can take months or longer. Put another way, you may find that a new public release of data actually includes information that is much older. As such, you will want to be sure you can tell when the data were collected, and how you should contextualize it when speaking about the present.

Ambiguity

The dimension of ambiguity is tied closely to that of verifiability. Here, you will want to ask yourself: Is it clear how they computed certain variables or reached a particular conclusion?

A good data source should be able to offer you most if not all of the information you need to understand key aspects of the data. For example, if you are drawing on a dataset about people’s media use habits, you should be able to read the exact question respondents answered (if it was based on a survey) or have clear information about how device use was captured (if it was based on software analytics).

This is important because researchers have found that even small changes in question phrasing and terminology can yield very different results, as can software design decisions for capturing the same information (e.g., time spent with an app). If an organization does not make that information available upon request, you should be careful with how you use the data — or even question if you should use it at all.

Additionally, it is common for organizations to compute new variables based on some data they collected, but only release the computed variables. For example, an organization might aggregate several states into the “South” region. You should be able to find out exactly which states comprise that “South” region.

Finally, researchers may “weigh” the data in some way to account for deficiencies. For example, if they find they under-sampled (did not include enough) young Latino males, the researchers might weigh the data from that group to better approximate the general population’s average. Ideally, the dataset you have access to will also contain the unweighted data. If it doesn’t, just be sure it’s clear how they weighted it (and that the weighting procedure makes sense).

Reputation

The last dimension is reputation, and it is also probably the one a journalist is most likely to rely on, especially if they are on a tight deadline. Here, you will want to ask yourself: Is the organization that collected these data reputable?

Just like you would for any human source, you will want to do some basic research beforehand to ascertain the potential motivations and expertise of the organization. It is helpful to start with a simple web search about that organization to get a sense of how often they are mentioned in news reports (and elsewhere), as well as the kinds of criticisms that have been lodged at it.

There are many organizations that regularly put out high-quality information, from non-profit organizations like the Pew Research Center to the Bureau of Labor Statistics. Reputable organizations will usually have earned trust over time and through demonstrated rigor, and it is helpful to keep your own diary/database of the reputable organizations that operate within your area(s) of interest/reporting beat.

If you cannot find much information about that organization, you will certainly want to be on alert. If you find an organization is known for engaging in some kind of advocacy, you will want to take a close look at their data collection methods and how they may have impacted the data. However, don’t simply exclude a data source just because it has not yet built up its reputation — we often see open data-minded individuals and groups release very useful information on platforms like GitHub.

Data Sources With Issues

It is important to note that each of these dimensions should be considered as part of a whole. Put another way, don’t rely on a single indicator when assessing the trustworthiness of a data source or dataset. Instead, the combination of all five dimensions will yield a far more holistic and helpful picture.

For example, just because a data source has a particular motivation or partisan affiliation, it does not mean that you should not use it. Again, let me underscore that: Just because an organization has an interest does not mean its data is untrustworthy. Sometimes, highly motivated sources employ sound data-gathering strategies (i.e., the data are valid and reliable) but simply offered colored interpretations of their data analysis. Similarly, just because you find some data entry errors does not mean the data are worthless.

All this means is that you will want to be extra diligent in ensuring that the data that a source collected are likely to offer a good approximation of the “truth” you hope to get at with your reporting. If the data and source are clearly problematic, don’t use the data. But if you simply have some reservations, don’t immediately discount the dataset. Instead, evaluate it closely and then contextualize the limitations in your story.