What is ‘data journalism’?
There is no single, universal definition of “data journalism.” However, one way to conceptualize it is as a process for telling journalistic stories through numbers or finding journalistic stories in them, in the service of and with a commitment to truth.
That process usually produces a journalistic product (e.g., a news story or a video) that has a central thesis (or purpose) that is primarily attributed to (or fleshed out by) quantified information; involves at least some original data analysis by the item’s author(s); and, typically, includes some visual representation of data.
In short, data journalism can involve either digging deep into numbers to find phenomena that journalists might have otherwise missed or using quantified information to tell a news story. (The best data journalism involves both.) It distinguishes itself as a form of journalism by emphasizing quantification and a rigorous, systematic, and analytic approach to the process of doing journalism.
Even though ‘data journalism’ has become a buzzword in recent years, it actually isn’t some brand new thing. Scholars have traced the origins of data journalism in the U.S. to the late stages of the so-called “Progressive Era” at the turn of the 20th century. During that period, there was an emphasis on using social science to address different social ills, and we saw then early examples of journalism that were informed by the best available statistics.
This emphasis on using social science in journalism emerged again in the 1950s and 1960s, when journalism reformers began turning toward the fields of political science and sociology for inspiration on how to ‘measure’ reality — and thus produce a better journalistic representation of it.
The most clear roots of contemporary data journalism can be traced to the idea of ‘precision journalism,’ which was formalized in a 1973 book released by Philip Meyer. True to its moment, precision journalism emphasized the use of social scientific methods in journalism. It rejected the typical journalistic practice of just asking a question of the first few people a journalist might encounter. Instead, it advocated for rigorously using systematic approaches to quantify phenomena into data and make assertions and contentions based on the evidence collected.
Precision journalism was thus presented as an alternative way of doing journalism and to bring journalism closer to the so-called “truth” by reducing different kinds of measurement bias that are common in ’typical’ journalism, such as the tendency to generalize from non-representative sources and incidents.
Although data journalism is often presented as a more trustworthy form of journalism, it is important to approach it with a healthy degree of skepticism — as you should for any form of journalism. Like all journalism (and any human recounting of events), data journalism by its very nature introduces varying degrees of interpretation. This makes it biased, like all other forms of journalism.
To put it in simple terms, data journalism is inherently biased. Those biases can originate in at least two crucial stages.
The first stage is the gathering of data. Regardless of whether a journalist collects their own data for a story or relies on data collected by a third party (i.e., a pollster), a human being still had to come up with the instrument and process for collecting those data. For example, a pollster still has to select a sampling frame and write the questions used in the poll — decisions that will invariably be shaped by their (and others’) biases. Even data collected by a non-human, such as a motion sensor, could be biased because an engineer still has to develop a method for capturing data that is likely imperfect under real-world conditions. (In this case, bias may arise from disproportionate inaccuracy under specific conditions, such difficulties in detecting someone’s presence if they have darker-colored skin.)
The second stage is the interpretation of the data. Journalism, like any form of human communication, necessarily involves some degree of interpretation. The most common form of interpretation in journalism is in the selection of what is newsworthy. Put another way, even a highly descriptive example of data journalism will likely require the selection of some information over others — such as selecting which date range is most newsworthy when reporting figures, which statistics to use to describe the data, and which elements of the data to emphasize. Moreover, the best journalism goes beyond description. It helps audiences make sense of the world. That sense-making process is inherently rife with interpretation.
All of this is not to say that data journalism is bad or untrustworthy. On the contrary, data journalism is a highly useful addition to journalism and does excel in certain areas. However, it is critical that you do not buy into the mythology of quantification and assume that something must be true because there are ‘hard numbers’ attached to it. In fact, the fetishization of quantification has resulted in some bad actors promoting disinformation by attaching it to falsified quantifications, junk science, and egregious misuse of data.
Good data journalism does tend to involve a greater degree of transparency than ’typical’ journalism. Put another way, it is typically easier for data journalists to ‘show their work.’ Often, though not always, data journalists can put the entire datasets they used, or at least the most relevant portions of it, online for readers, viewers, and listeners to scrutinize. Even better, the data might be visualized in some interactive fashion that allows audiences to explore them and reach their own conclusions, should they want to. Ideally, all of the data would also be documented in some fashion, with clear descriptions of what is represented in the dataset and how observations were collected and quantified.
Transparency is also particularly important in the current moment. First, trust in journalism (as an institution) is at fairly low levels in many parts of the world, and certainly in the United States. In lieu of simply trusting journalists, audiences are now more likely to demand to ‘see the work’ before accepting an assertion. Second, news brands are less powerful than before because audiences are now quite likely to encounter journalism via news aggregators and social media. Audiences thus pay far more attention to the story than its originator. Consequently, the ’trust reservoir’ some organizations have built via their brands over years is less salient today.
In both instances, transparency becomes a way of acknowledging the audience’s concerns and demonstrating that the journalist stands behind their work. The onus is thus on the journalist to show where a fact came from and what the basis for an assertion was. This, in turn, may generate greater trust in the journalist’s work.
Moreover, transparency can be a useful for engaging with audiences by inviting them to review data and analyses, correct errors, and contribute ideas (or even additional data). Data journalists will sometime include interactive tools to help audiences explore the data in ways that go far beyond the scope of the original article.
Even though variations of data journalism have been around for some time now, there are some recent trends that have made it particularly popular in recent years.
First, there is increasing pressure on local, state, and federal officials to be more transparent, and in particular to release more of the data collected by different public agencies. Consequently, those agencies have begun digitizing more of the documents they produce and the data they collect. This sometimes results in agencies posting relevant data on their websites, which gives journalists a fountain of source material they can use for a potential story.
Second, personal computers have become incredibly powerful and are now ubiquitous in newsrooms. This means that a journalist working on a typical laptop can crunch large amounts of data and look for interesting phenomena, which is something that required specialized hardware not long ago. Similarly, citizens have access to powerful computers themselves, from their own laptops to the phones in their pockets — which are now capable of rendering highly appealing and interactive data visualizations. Audiences love data visualizations for its aesthetic potential and interactivity, which has created substantial demand for data journalism.
Third, individuals with technical aptitude and statistical know-how have found more of a home in newsrooms. They are now seen as having the ability to be journalists in their own right, rather than simple technical support for ‘real’ journalists. As such, it is now more common to see well-resourced journalistic outlets have dedicated data journalism teams that include members with backgrounds in information technology, user-centered design, and computer science. Moreover, the software for collecting, analyzing, and visualizing data have been simplified and made more accessible. Consequently, journalists who lack a background in statistics or computer programming now have a much easier time producing data journalism than in the past.
Although I’ve been using the term ‘data journalism’ throughout this chapter, please keep in mind that this book is titled Data-Driven Storytelling. This emphasizes the role of stories, which help give life to data and get people to care about them. Throughout this book, you will be challenged to do more than just put words to numbers. Your goal will be to tell a compelling story through them.