One of the key skills journalists hone over time is the art (and science) of the interview. Interviews with human beings — like the person-on-the-street who witnessed a fire or a scientific expert on vaccines — is a staple of most journalistic stories. Interviewing takes practice, but it helps us unearth information journalists wouldn’t otherwise know and it allows them to challenge their own assumptions.
“Interviewing” data is similar in many ways to the interviewing a human being, but it does add a few wrinkles to the process. It begins with thorough preparation, proceeds to generating thoughtful questions, and concludes with seeking out corroborating evidence.
The first step in preparing to interview your data is to understand the data’s ’expertise.’ Put another way, what kinds of questions are your data able to answer?
In preparing for a human interview, you’d want to familiarize yourself with the interviewee’s biography and recent work. In this case, you’ll want to start by exploring the dataset’s supporting documentation, including its codebook and any methodological details that might be offered.
The codebook is useful for showing you what kind of information the dataset contains. For example, a dataset about annual state standardized high school test results (like the MCAS might include variables like year
, school
, district
, prop_white_nonhisp
, avg_eng
, avg_math
, achievement_pct_eng
, and achievement_pct_math
. While the meaning of some of these variables (e.g., school
and district
) are pretty evident — they likely refer to the name of the school and the school district — the others are far less clear. However, the codebook may specify that achievement_pct_math
refers to the school’s achievement percentile in Math.
While that additional detail is very helpful, we may still be unclear on what an “achievement percentile” actually covers, or how it was calculated. This is where a methodological supplement would prove helpful. Through that supplement, we may learn that the measure is a scaled score used to compare an individual school against all other public schools in the state.
Not all datasets come with a separate codebook or a detailed methodology, but many do. It is thus worth exploring the website of the data source to look for that supporting documentation. Put another way, don’t just stop looking once you’ve found the data file — get a sense of what other information about it is available, and be sure to keep a copy of those related documents.
When you feel you’ve done your background research for the data, you’ll want to make your introductions. This is a crucial point because your expectations of the data may not be matched by the reality of what they can offer. In fact, seasoned data journalists will often jump straight to the introductions, and only take a step back to prepare if they’re satisfied with the first impression.
The process begins by opening the dataset and glancing through it. Oftentimes, spreadsheet software works best for this initial step because they make it easier to visually skim through data and quickly sort the values of different variables and apply simple filters. For example, a quick glance can get you a sense of the different values for the year
variable in that standardized test results dataset, which in turn allows you to have a better sense of the time span covered by those data.
This is also a good opportunity to look for obvious flaws in the data, such as blank rows, evidence of a lot of missing information, and maybe even unexpected information for observations you are familiar with (e.g., a much higher number of enrolled students for a local school). For example, you may want to apply a simple filter to that local school to ensure that you have information about it for all of the years covered by the dataset. You can also take a quick break from the dataset itself and use this time to search for news stories (or reports) that have made use of those data, paying attention to the caveats they reported. (These are indicative of shortcomings in the data that others have found through their own exploration, which is useful context for you.)
It is important to note that datasets — especially those produced by government agencies, such as a state’s Department of Education — is often produced and shared in order to fulfill a legislative mandate or to help public officials make educated decisions. Put another way, those datasets are rarely ever produced with journalists in mind, and are thus not designed to make the journalist’s job easy. The data may thus use confusing terms of art, be organized in a seemingly haphazard way, and have some “messy” elements that need cleaning up for the purposes of producing data journalism.
Additionally, it is crucial to never assume that the data are complete and mistake-free, no matter what the data source is. Always approach a dataset with a skeptical eye and assume there may be some important mistakes and caveats — which is what you are looking to unearth during this initial meeting.
When you have a good sense of what the dataset covers, you are ready to start asking questions of it. Unlike human interviews, which tend to predominantly involve open-ended questions (e.g., “What do you think about the argument that standardized testing in Massachusetts isn’t fair to some racial groups?”), the process of interviewing data requires the interviewer focus on close-ended questions that feature quantifiable measures.
What this means is that you’re typically asking questions that offer you a ‘yes’ or ’no’ answer about some relationship between variables, or offer you a list of values as a response. In fact, much of data journalism is anchored to a few basic question types:
What are the best/highest or worst/lowest data points (or groups)? (For example: “Which school (school
) had the highest average math score (avg_math
)?”)
What is the average for the data, or how do averages compare across groups? (For example: “Does District X (district
) have a higher average English score (avg_eng
) than District Y (district
)?”)
What is the trend for the overall data (or specific data points/groups)? (For example: “Has District X (district
) shown an increase or decrease in its achievement percentile in English (achievement_pct_eng
) over the past five years (year
)?”)
What is the relationship between variables within this context? (For example: “Is there a relationship between the proportion of non-Hispanic Whites (prop_white_nonhisp
) and the achievement percentile in Math (achievement_pct_math
)?”)
It can be very tempting to ask an interesting and important question like, “Does the racial make-up of a school impact its test scores?”
However, to truly address that question, the analyst would need to control for different factors (take into account a range of other variables that might impact a relationship) in order to isolate the effect of race on test scores. Such analyses are often more complicated than what typical data journalists are comfortable doing because it involves an understanding of (a) more sophisticated statistical approaches and (b) the potential factors that should be accounted for in such an analysis.
In such cases, it is best to either partner with a researcher who has both the methodological and subject expertise (e.g., a public policy analyst who focuses on education) or to simplify your analysis by simply asking questions about bi-variate (two variable) relationships and contextualizing the shortcomings of the analysis within the story.
While this may seem like a rather basic and limiting set of questions, it proves surprisingly versatile. For example, they can be used to examine if a local government initiative from last year has proven successful; if a pedestrian death at an intersection was an isolated incident or is indicative of a broader issue there; if some ‘conventional wisdom’ is actually supported by the available data; if disparities exist (and are improving or worsening) among different groups; and so on.
As you prepare your questions, you’ll already start to notice that the interviewee (the dataset) doesn’t have the expertise (variables) needed to answer some important questions. As you encounter these shortcomings, create a separate list of the expertise (variables) you wish you had access to. This will help guide you in your search for complementary datasets that you could potentially merge in order to ask questions that are not answerable by a single dataset alone — like the potential relationship between the median income of a community (gathered by the U.S. Census Bureau) and the average math score of its schools (gathered by the Department of Education).
Similarly, as you begin asking your questions, the answers you get from the dataset may spark new questions. For example, you may find that schools in the western portion of Massachusetts have worse test scores than those on the eastern portion of the state, which might then raise new questions about local school funding and state aid to different districts that you could explore. This process is similar to the art (and science) of follow-up questions in interviews with humans, but the nice part is that datasets are far more patient than a human being. Put another way, the interview can last as long as you like and you can ask many different variations of similar questions in order to get at the most interesting and insightful answer.
You will also find that your interview raises questions involving things that are either not easily quantified or that have not yet been quantified in a systematic manner. Put another way, your interview with a dataset may raise questions better answered by a human source. Be sure to write down those questions, too, and prepare yourself to pick up that phone or send off that e-mail for an interview request.