There are two main types of data sources that data journalists should be aware of: primary and secondary.
You can think of primary data sources as those that originated the data. Primary sources are thus the ultimate source of any recorded observation. They are also typically the ones able to offer data as close to their original (or “raw”) form as possible — that is, prior to any aggregation or modification. Primary data may include all of the responses and variables from a survey of depression among college students, or the observations from water quality sensors strategically placed around a region.
You can think of secondary data sources as those that have been aggregated or otherwise modified by some third party (not the originators of the data). This may include, for example, a single dataset produced by a non-profit environmental group that combines information from multiple government datasets. It may also include a “cleaned” version of a dataset produced by someone else.
Data journalists will often begin by looking for the original (“raw”) data and obtain it from the primary source. That is because there are several advantages to this strategy.
By working with the original data, you can dig into all potential aspects of those data and not be restricted to what a third party believes is most interesting (e.g., their decision to aggregate data to the state level). Like journalists, the authors of studies, white papers, and reports can only discuss some aspects of the data due to time and space limitations, and they might not focus on the relationships that might be most interesting to the citizens in your community. Similarly, some agencies will chose to only post a portion of their data online in order to (a) reduce the size of the dataset that they have to host and (b) make those data more manageable for ‘regular’ citizens. However, they may be able to provide the complete dataset upon request.
Additionally, by using a primary data source, a third party’s analytical mistake does not have to affect you. Put another way, you can run your own calculations on the primary data and thus have greater confidence that everything is in order, from the quality of the data in the dataset to the way a mathematical operation was executed.
There are also certain advantages to working with secondary data that can make them an alluring option for data journalists.
Doing your own data analysis might be really time-consuming, so a data journalist may want to let someone else do that heavy lifting. Additionally, the third party may be better at analyzing data or have certain subject expertise that a data journalist lacks. For example, a third party may be more capable at detecting and correcting issues in the original data and combining it with outside data to produce a more robust and accurate dataset. In those cases, it might be perfectly sensible to rely on secondary data sources to produce accurate and insightful journalism.
A data journalist simply must be certain they really trust the source of the secondary data and recognize that this option may limit the journalist’s ability to find stories in data and tell them through data. Put another way, there is a lot of useful information that can get lost through the filter of a third party.
Data journalists can also collect data themselves, or rely on a third party to collect the data on their behalf. Indeed, news organizations have won awards for their original data collection efforts, as with the Washington Post’s database on fatal shootings involving police.
Collecting one’s own data has several advantages. The main advantage is that the journalist can choose exactly which variables to get data for and who to get it from. Data journalists will often find that the data they are most interested in have not yet been collected by anyone else. Put another way, there is no data if the journalist does not collect it themselves. Additionally, data journalists can also get the most recent data possible if they do their own data collection. If they rely on existing data, journalists may have to work with information that is months if not years old.
Media organizations may also wish to collect their own data but delegate the task to a trusted third party that is experienced in collecting data. This is often the case with polling data, but it can also involve other tasks. Letting a third party take charge of data collection can yield better data since research consultants tend to have the skills and background necessary to increase the likelihood the data will be representative and robust.
Journalists will generally rely on information collected by some third party, like the U.S. Census Bureau, a sports league, an interest group, or a polling firm. That’s because journalists don’t usually have the time, money, or expertise to collect their own data. Indeed, such data are sometimes free and highly useful, as with much of the data collected by local, state, and federal governments. Put another way, different types of data sources can all be useful for producing data journalism — and much of the data journalism we see today is the result of observations collected by third parties.
Additionally, it is worth nothing that it is not the type of data source that differentiates data journalism from journalism with statistics. Instead, the differentiation comes from what the journalist does with the data. Data journalism involves an original analysis of a dataset — whether collected by the journalist, provided in a “raw” form by a first party, or even aggregated to some extent by a third party — while the latter involves picking out interesting elements from some third party’s data analysis (e.g., a striking quantitative finding from their report).