Finding Data Online

Introduction

Data journalists rarely work with data they have collected themselves. Rather, the data usually comes from third parties, like government agencies and nonprofits.

But how do they find those data?

Data can be obtained from two main types of sources: private sources and public sources. It is important to separate these two because they require different considerations in terms of how to access the data.

Private Data Sources

Private data sources are typically comprised of commercial companies and interest groups, and they do produce large amount of data. Sometimes, the data are released for free. However, private sources will often try to monetize the data in different ways, or to publish them selectively to promote a particular interest.

Examples of Private Data Sources

For example, Yahoo! released a massive dataset in 2017 that covered how 20 million of its users interacted with news items on their website over the course of a month. This is way more data than most journalists would ever want work with, but it presented an opportunity for enterprising media critics to dig deep into how people consume news — at least on Yahoo!

On a much smaller scale, the Pew Research Center’s Project for Excellence in Journalism typically releases the data for the surveys it conducts looking at how and what news content people are consuming. These data are typically released after an embargo period — meaning that they will typically put out some analytic reports before they make the underlying data public. Nevertheless, an enterprising data journalist could use those data to explore how news reading habits have been changing over the past few years across different groups of people.

A third example comes by way of the NFL (or really just about any major sports league). They typically make an array statistics available online, and sometimes for download. These include team-level statistics, like the number of games the New England Patriots won last season, to player-level statistics, like the number of yards per carry former Minnesota Golden Gopher running back Laurence Maroney had in 2008.

There are also some websites that have popped up to make data from private organizations more accessible. For example, Sports Reference has several sub-sites that allow users to easily export data for sports like football and hockey. This makes it much easier to access data than the websites of most sports leagues, whose data must often be scraped off the website by a computer program.

Caveats About Private Data Sources

Private sources are not legally required to release any data. While they sometimes release data as an act of generosity, they typically do so to either fulfill some mission or to advance a particular agenda — or, if they are monetizing those data, simply to generate revenue.

For example, one nonpartisan interest group may have a mission to inform the public about the issues, attitudes, and trends shaping America and the world. That group’s release of public opinion data may simply a part of that nonpartisan mission. In contrast, an organization affiliated with a political group may have an agenda they seek to promote, like opening up more areas in federal land to drilling or mandating stricter air quality requirements. Those groups may bypass scientific rigor in order to get a desired result, or simply choose to selectively release the data that helps them make a point.

Just because data are coming from a private source does not mean that they are necessarily problematic, or that they should be dismissed. However, they should be considered more carefully if there’s any hint that such data are being used to promote a particular agenda.

Public Data Sources

Public data sources are comprised of public bodies, such as federal, state, and local government agencies. It also includes intergovernmental bodies like the United Nations. These are major producers of data, and they operate under a different legal regime because of their mandates to inform democratic decision-making. Put another way, these data tend to be both highly useful and far more accessible to journalists.

Examples of Public Data Sources

For example, the town of Amherst collects and disseminates information pertaining to noise complaints. A journalist could therefore look up where and when every recorded noise complaint took place. As a service to the community, Amherst maps these data. While the town does not make those data available for download on its website, a journalist may request it from the public body.

Another example comes by way of Massachusetts’ Department of Elementary and Secondary Education. That department has information about the pay of all teachers working for a state-funded school, and makes readily available aggregate statistics by district for comparison. A journalist could couple those data with educational performance indicators to evaluate the relationship between teacher pay and student achievement in the state.

A third example comes from the Department of Homeland Security’s Yearbook of Immigration Statistics, which collects annual data on people coming in and out of the United States. This can serve as a useful data source for assessing the number of people coming to the United States each year, and from where. Put another way, that data source can help bring some important context to the conversations about U.S. immigration policy that seem to come up regularly in the news.

Caveats About Public Data Sources

In contrast to private data sources, public data sources in the United States are often legally required to provide public access to the data they collect. In many jurisdictions (e.g., Massachusetts and the federal United States), there is a presumption of openness — that is, data are presumed to be public and should be made easily accessible. The logic behind this is that greater transparency should make corruption more difficult, and that a better-informed populace can demand better actions from their elected leaders.

To be clear, there are limits to that openness. For example, private personnel information (like the health records of public employees) and information that may endanger national security are typically excluded from public access because the potential harm is believed to exceed the potential benefit in most cases. Nevertheless, such exclusions can be challenged in court — something that cannot be done when it comes to data from private sources.

Finding the Data

Although the internet is home to a substantial amount of data from both private and public sources, it is not always easy to come by a dataset of potential interest.

Public data tend to be the easiest ones to find precisely because they are so often used (and thus linked to from many places). A good starting point is to look for a data portal, which is typically an aggregator or directory of public data. There are data portals at the federal level, and some data portals at the state level.

For example, Data.gov is a good resource for identifying potential datasets at the federal level. It lets you start with a topic or state agency, and provides a list of several popular government datasets. You will often be given direct links to multiple file formats for a desired dataset, as well as to documentation that describes the information contained in the dataset. Similarly, the Massachusetts Document Repository is an initiative by the Commonwealth of Massachusetts that tries to connect citizens with information (including datasets) generated by public bodies in that state. Like Data.gov, MassData allows you to search by state agency or by some topic, and links directly to the datasets.

In fact, there is even a data portal for data portals, which can be a useful guide for quickly finding relevant data portals from around the world. Indeed, international data portals — like those hosted by the United Nations and the World Bank — can be very useful to offer international points of comparison.

It is important to note that while those data portals are great starting points, they are not a complete catalog of the public information that is out there. Some information — especially datasets that are accessed infrequently — may exist on an agency’s website but not appear on any catalog. Moreover, the information in those data portals is often out of date. For example, they may list reports from two years ago despite the fact that new data are available. As such, these portals are most useful for helping you identify the bodies (i.e., agencies) that collect certain kinds of information — and you should then visit the websites of those bodies in order to seek out the most up-to-date information (or another dataset that may not have made it onto the portal).

Data-Seeking Techniques

Because data from private sources is so diffuse, it is not surprising that there isn’t a comparable service for finding them. Instead, journalists have to rely on traditional reporting techniques for discovering information when it comes to those sources.

Social media, messages boards, and the like are great places to visit in order to ask questions about where to find data. It is especially fruitful if you can find out where enthusiasts and people interested in your topic like to hang out. For example, a forum for BoardGameGeek would probably be a helpful place to ask for data on the popularity of certain types of board games. On Twitter, the #ddj hashtag helps to organize conversations around data journalism.

Mailing lists geared at data journalists can also be a good option. For example, NICAR (National Institute for Computer-Assisted Reporting) has a mailing list for its members to share tips and data sources. This can be especially useful if a fellow journalist has already gone through the process of making a formal public data request and can thus share those data directly with you (potentially saving you months of time waiting for an agency to fulfill the request). Similarly, you could sign up for e-mail newsletters led by data journalists, such as the weekly Data Is Plural newsletter that includes five interesting or relevant datasets in light of recent news.

You could also seek out an expert in the topical area you are reporting on. For example, a political scientist studying campaign messaging may be able to point you to multiple datasets pertaining to political ad spending — including data they have collected themselves. You could also seek out news reports and specialty websites that try to illustrate issues through data (e.g., Our World in Data) and then look for the data sources they list. Even if the original content is older, it is possible that the source provides regularly updated data.

Finally, a very handy skill to work on is your Google-Fu, or the art of using the advanced features of powerful online search engines. Search engines like Google and DuckDuckGo have an advanced mode that allows you to use operators to create complex search queries. They also allow you to restrict your searches to certain filetypes, like PDF documents or Excel spreadsheets. One of its most useful features is to restrict searches to a particular domain name or top-level domain, so you are only searching links from a particular website (or kind of website).

As an example, I can use my Google-Fu to quickly track down the most recent data on student tuition charges at UMass. I can use DuckDuckGo to construct a query that reads site:umass.edu student charges filetype:pdf. This tells DuckDuckGo to only show me results from the UMass domain (site:umass.edu) that contain the keywords student charges and are PDF documents (filetype:pdf).

Google has also created a for finding datasets that is aptly titled Dataset Search. This leverages Google’s impressive web spidering technologies with a data-focused application, and produces an interface that is fairly accessible and frequently updated via its algorithms. The downside of this tool is that it is not curated, and can thus overwhelm you with information.

If you don’t know what to search for, use general terms to find some organization that might collect data on your topic of choice. Once you have found that organization, either look around their website for sections of the website that have terms like “data” or conduct a search for that term that is restricted to the organization’s domain.

If you can’t find that section, you may want to just call someone at that organization and ask if they can point you in the right direction. If you’re not sure who to call, try to look up someone with the title of “public information officer,” “public records custodian,” or someone who appears to be in charge of records. Failing that, look for a media contact, or someone in a department that might do something close to the information you’re interested in.

Concluding Thoughts

Keep in mind that all of these techniques are only helpful if the data have been made available online. It is thus crucial to underscore that a lot of data are collected but not made available online. If you can’t find the data online, you may need to make an open records request from a public body or ask a private source for the information and hope for their goodwill.

The key is to not give up just because you can’t quickly find the data online. It might be out there, but not in an easy-to-find place.