Working With Messy Data

Introduction

A number of datasets that might be of interest to you may come “messy” and thus need some “cleaning” before a data journalist can use them in a data analysis.

While the term “cleaning” may sound nefarious here, it is really not. It is part of the behind-the-scenes grunt work that data journalists will spend long hours doing before they can get to the more-fun data analysis stage and produce spiffy visualizations.

Common Issues in Messy Datasets

“Messy” data simply refer to datasets that have a variety of correctable errors in them. Such errors may include inconsistent formatting, variations in the spelling for a distinct unit, and missing values, among other minor errors.

For example, a data journalist may come across a column on revenue, and have values displayed in a number of different ways (some using thousands separators while others do not) or in different currencies (e.g., U.S. Dollars, Euros, Romanian Lei) that need to be converted into a single currency (e.g., U.S. Dollars) order to permit an apples-to-apples comparison.

A dataset may also have what should be a distinct unit (e.g., United States) be referred to different ways, such as U.S., U.S.A., US, USA, United States of America, America, and, of course, United States. Furthermore, many data analysis tools are case-sensitive, meaning that it will treat USA (all uppercase) differently from usa (all lowercase).

A data journalist may also find themselves with a dataset that has some missing values (i.e., blank cells). Missing values can be especially problematic, and how to deal with them will depend on the type of data that’s missing, the amount of data that’s missing, and your statistical skills. A variable with a few missing values can often be cleaned (or the missing observations noted), but one with many missing values must sometimes be omitted. Additionally, missing data can sometimes be represented in different ways within the same dataset (e.g., an empty value, as NA, and as NULL) — an inconsistency that needs correction.

Value of Clean Data

It is important to have a clean dataset to work with because messy data can skew one’s analysis in important ways.

If a dataset has instances where the name of a country is misspelled, a data journalist may be unable to correctly map the dataset in an automated fashion. This often happens when data journalists use services like Google Maps to automatically guess a location based on some string (e.g., Washington D.C.); if a location’s name is misspelled, Google Maps may place the point in a completely different part of the world.

If there are variations in the spelling of some entry, a journalist will be unable to get the correct sum of all instances of that entry. Instead, they might get three different sums that should be part of a whole (e.g., Massachusetts, Mass., and MA). If those three sums are below some threshold, they may be grouped into an “Other” category instead of standing out as a single category, as they should.

Missing values can be especially problematic if they are not normally distributed. For example, let’s say a journalist obtained a dataset that included information about the average height of male and female basketball players and, for the purposes of simplicity, decided to exclude players that did not have their height listed. If more data were missing among tall female players than any other subgroup, the analysis would indicate a bigger discrepancy than is actually the case.

How to Clean Messy Datasets

There are many specialized tools for identifying common errors in datasets. For example, OpenRefine includes advanced features like clustering algorithms and regular expressions that make it easy to find similar variations of a distinct entry (United States and Untied States) and make corrections en mass. If you are working with a large dataset (e.g., thousands of rows, or larger), it may make sense to start with an advanced tool.

If you have a small dataset, however, it might be wiser simply use spreadsheet software like Google Sheets or LibreOffice to eyeball the data and perform filtering operations to identify problems. Then, you can make simple corrections by editing individual cells or finding and replacing values across the spreadsheet. Software like R can also be useful for doing simple “sanity checks,” such as by checking the maximum and minimum values within a numerical variable or the number of unique values within a string variable.

It is important to note that not all errors can be cleaned. Sometimes, data may have simply been collected in a poor fashion, and there is just no remedy for that on your end. However, a good portion of the data you will encounter today — especially data collected more recently and through an entirely digital workflow — will not need any cleaning. Whenever that is the case, be sure to thank the Data Gods!

Ultimately, it is worth keeping in mind that just because your dataset is messy, it does not necessarily mean the data in it are incorrect or problematic. It is certainly something to consider in your evaluation of the quality of the dataset and its source, but it may simply be the result of an organization having limited resources or working off messy original data of their own (e.g., handwritten traffic tickets). If you decide that you trust the data and the data source, just be sure to inspect it to determine if it needs some cleaning — and, if so, to clean it as best you can before moving forward with an analysis.