A census refers a count (or dataset) that includes observations about all entities that a data analyst is interested in. For example, if I had a complete log of every time one of my students opened a file on Moodle — and I do, thanks to our digital enclosures — then I would have a census of those data. Every possible interaction with that file has been logged, so any general observations I make from those data would be applicable to all of my students.
In an ideal world, data analysts would always have access to a census of the data they are interested in. Unfortunately, this is often not the case as researchers tend to lack the resources necessary to gather information from the entire population they are interested in. For example, it costs more than $10 billion dollars for the United States to conduct its decennial census of all people living in the United States. That is way more than any group of researchers (or data journalists) could possibly afford if they wanted to conduct a poll of attitudes toward the President of the United States.
To overcome that practical challenge, analysts often draw on a sample of the population. A sample simply refers to a relatively small slice of the population the analyst is interested in learning about. The analyst would then gather information from those entities (e.g., people) and project the results of their analysis to the entire population. This act of projection is called “inference” or “generalization.”
Whether a dataset includes a census or a sample of the population (e.g., people, schools, golden doodles) the data set has collected data about is one of that dataset’s most important features. That’s because if the dataset is based on a sample that is too small or biased in some way, the data captured within it may not accurately represent the phenomena that dataset purports to capture.
For example, if I only had those Moodle data about my few Honors students, then that might give me a fairly inaccurate sense for when, and how often, all of my students interacted with that file. Alternatively, if those data only captured interactions between 9 a.m. and 5 p.m., then that might not give me an accurate sense of the interactions over the course of the entire day. Put another way, these two different slices of the data have certain biases that impact my ability to generalize them to the broader population.
There are two main types of sampling strategies that data journalists should be aware of: probability sampling strategies and non-probability sampling strategies.
Probability sampling involves some process or procedure that ensures that the different constituents of the population you are interested in have equal probabilities of being chosen. In contrast, non-probability sampling does not ensure equal probability of selection.
Let’s start with some examples of non-probability sampling. These include convenience samples, snowball samples, purposive samples, and quota samples.
With a convenience sample, members of the population are selected based on the ease of access. For example, to learn about college students at UMass, I may draw on information from students in my class because they are accessible to me — not because they are representative of all undergraduate students at UMass. The obvious limitation here is that this convenience sample likely has some biases (including their high likelihood of being Journalism majors).
With a snowball sample, members of the population are selected through referrals from another person. For example, HIV-positive males may refer other HIV-positive males they know, which over time come to comprise the sample. The obvious limitation here is that people tend to know (and feel comfortable referring) people who are like them, which introduces some biases (e.g., similarity in attitudes).
With a purposive sample, members of the population are selected because of some characteristic they have that is deemed to be appropriate (based on the researcher’s perspective). For example, the researcher may only be interested in the students who are in the top 5% of 10 local high schools. There is also a deviant case strategy where members are recruited based on the extent of their difference from the norm. Purposive samples can be very useful in specific contexts but they simply can’t be generalized to the entire population (e.g., students at the local high school).
With a quota sample, some quota is established beforehand and people are added in a non-probabilistic manner until that threshold is met. For example, the researcher may decide that 60 percent of the people in a sample of 100 grocery shoppers must be under the age of 22, and the researcher will select the first 60 people they come across at the grocery store that fit that criteria. The fact that they relied on the first 60 people can introduce some biases (e.g., people who are likely to be early-morning shoppers).
Now, let’s consider the alternative: probability sampling. These include simple random samples, systematic random samples, and stratified samples.
It is crucial to understand that the use of the word “random” does not imply haphazard. In most instances — but not all — random sampling is superior to non-random sampling if the goal is to generalize the observations.
With a simple random sample, members of the population are just randomly selected from a larger population. For example, I might put the names of every one of my students in a hat and draw half the names. Because every student’s name was in there, and there is only one slip of paper per student, everyone had an equal chance of being chosen.
With a systematic random sample, members of a population are selected following some systematic pattern. For example, the researcher might have a list of names and then pick an arbitrary starting point, selecting every X name after that. Again, everyone had an equal chance of being picked if a random starting point and iterator was chosen.
With a stratified sample, the population is divided into subgroups and then random selection is applied to each subgroup. For example, the researcher may decide that 60 percent of the people in a sample of 100 grocery shoppers must be under the age of 22, and the researcher will then randomly select 60 people from among all grocery shoppers who were under the age of 22.
These strategies have two things in common: First, every element has a known non-zero probability of being sampled. Second, they involve random selection at some point.
It is important to understand that no one sample is inherently better than the other. However, if a sample is non-probabilistic, then we should not attempt to generalize from it to a broader population because it will likely be highly biased. We should only describe the sample we have obtained.
If you have a probabilistic sample and want to generalize from it, then it is important that the sample represents the population being generalized to.
For example, if you are aiming to generalize to the U.S. population, then your sample should have same characteristics as the U.S. population. In contrast, if you are aiming to generalize to adult females living in Western Massachusetts, then that is what the sample should be comprised of.
Every sample contains some amount of sampling error, which refers to the difference between a statistic and its ’true’ value. Put another way, if you were to gather information from two different samples, chances are you would get different results. Ideally, those differences should be small (and have insignificant practical effects). However, the differences can also be quite large — especially if the sample was poorly constructed, whether in size or recruitment process.
It is possible to estimate the sampling error using statistics in order to provide a measure of the reliability of the findings. For example, when researchers report survey results (e.g., presidential polls), you will often see a statistic labeled “margin of error.” This is intended to describe the degree of certainty associated with some finding.
Let’s illustrate this by considering the following statement: “We can state with 95% confidence that 82% of people believe Prof. Zamith is a magnificent soccer player, with a margin of error of 3%.”
We can interpret that statement as: If we sampled the same population 100 times, between 79% and 85% of respondents would believe Prof. Zamith is a magnificent soccer player in 95 of those 100 times. (Unfortunately, these people are quite wrong.)
In practice, data journalists rarely describe things in that manner. Instead, they just look for a reasonable confidence rate (e.g., above 90% — though higher is better) and report the margin of error, ideally contextualizing that latter figure.
For example, if a poll with a 4% margin of error finds that 48% of people believe that Prof. Zamith is the best soccer player of all time while just 46% of people believe the same for Pelé, the journalist should note that neither player is a clear favorite. That’s because my ’true’ percentage might actually be as low as 44% while Pelé’s might be as high as 50%, due to the estimated sampling error.
This important context is something that journalists without a background in data analysis often miss. For example, political journalists will often report on poll figures that claim Candidate A is ‘ahead’ of Candidate B by 2 percentage points (48% vs. 46%), and fail to note the poll had a margin of error of 4%. A more sophisticated journalist should frame the stories as the candidates actually being in a statistical tie, or that a race is neck-to-neck.
Many of the lamentations about the polling ‘failures’ of 2016 (and 2020, to a lesser degree) failed to note that many polls had Donald Trump (and other ‘underdogs’) in statistical ties. This means that while polls tended to show Hillary Clinton (and other candidates) ahead — which should inspire confidence that they were more likely to win their elections — the end result was actually within the parameters many estimates. Put another way, it was always a possibility (and not at all a negligible one) given the way things were measured.
In light of this, we can argue that the ‘failure’ was less statistical in nature and more in terms of how journalists (and pollsters) contextualized the information — or failed to, in this case — for their audiences.
In order to properly contextualize a margin of error, it is also important to understand a few caveats about it.
First, the margin of error is always an estimated statistic. It does not reflect the quality of sampling strategy (i.e., you can always calculate a meaningless statistic about a non-representative sample) and it reflects our current understanding of mathematics. The key thing to understand, then, is that a poll that only surveys people by calling their landlines (i.e., excluding cell phones) might have a low calculated margin of error but still miss the ’true’ value for the general population by a wide mark because they systematically omitted members of that population (i.e., college students who typically only have cell phones).
Second, surveys that have really wide margins of error are usually under-powered, meaning that the size of the sample is too small relative to the overall population. In order to make the margin of error smaller, the researchers would need to survey more people (which adds to their costs). Journalistic outlets like FiveThirtyEight will often rate a poll based on their assessment of the methodology (i.e., sampling strategy and the way questions were asked) as well as the sample size.
Third, the margin of error should be understood relative to the population size when that margin is reported as an absolute figure (rather than a percentage). For example, the Bureau of Labor Statistics might note that their unemployment figures have a margin of error of 110,000 people. That may seem large but relative to the population they’re generalizing to (job-seeking adults, which exceeds 150 million people), that would be less than one percent — which is a pretty small margin of error.
There are tools that allow data journalists to estimate the sample size needed in order to have a certain margin of error (within a specified confidence level). Unless you are collecting your own data, this is likely to be more educational than practical, though. Thus, as long as you understand how to interpret sampling error, its limitations, and how to report the statistic for a general audience, you are in good shape.
When it comes to sampling error, a census is always better than a sample for the simple reason that a census has no sampling error. Put another way, if you have everything, there is no need to estimate anything (i.e., make an inference). You can just describe aspects of the dataset and you will be speaking to the entire population.
This observation only applies to a true census — that is, you have access to data about the entire population. The U.S. Census, for example, is not a true census because at least some people don’t respond to it. Instead, the U.S. Census Bureau estimates responses for non-respondents, which introduces some degree of error.)
While a census is better than a sample when it comes to sampling error, it is important to note that a good sample can be more useful in a practical sense. That’s because a census can be too big and cumbersome for the typical analyst to analyze — and, in some cases, be too large for their off-the-shelf computer to process. In that case, a proper sample might tell you just as much and be easier to work with.
There are entire classes dedicated to sampling theory, and some data journalists (and other journalists who often work with studies, such as science journalists) do end up taking those. However, for now, just remember a few key things.
First, samples can be probabilistic and nonprobabilistic. The type of sample affects how inferences to the population should be made.
Second, the sample should always be representative of the population it is trying to generalize to. A bad sample can’t be saved by any sort of statistical wizardry.
Third, every sample will have some sampling error that is indicative of how likely it is that a reported statistic reflects its ’true’ value. Keeping that sampling error low helps ensure the finding is reliable, but that statistic comes with its own limitations.
Fourth, if you have a census, you don’t have to worry about sampling error, and descriptive statistics are usually the tools of choice in those cases.