3. Data Cleaning

Key Questions & Concepts

What is data cleaning?
How can it benefit your work?
Can you avoid it?
How should you approach the task?

Introduction: Why is Data Cleaning Important for Hotspotting?

Data cleaning sounds like a chore, one you might like to avoid. Unlike having to clean your room before you can go out to play, however, cleaning and playing with your data really are closely connected. Data cleaning is an essential part of your analysis, not just a clerical task.

If your data haven’t been cleaned, your results can’t be trusted; you can’t reliably tell the difference between insight and anomaly. Working with hundreds of thousands of hospital claims records, you are guaranteed to discover errors, and guaranteed to miss many more. In this module we’ll discuss where these endemic data errors come from and why knowing their sources can be valuable in keeping you from wasting your time or introducing new errors of your own. We will discuss the types of data error you are likely to encounter and how to detect them. We’ll talk about how to document and correct data errors and how to decide when your data are clean enough.

A fundamental procedure in data cleaning is partitioning data quality concerns into categories so that you can efficiently apply consistent solutions to the largest possible class of problems at a single time (what Hadley Wickham has called the “split-apply-combine strategy”1). To that end, after discussing more universal data cleaning structures and strategy in section 3.1 and  techniques and rules in section 3.2, sections 3.3 and 3.3 will deal with the special cases of unifying conflicting, duplicate, or unlinked records that refer to the same patient (probabilistic linkage) and standardizing and locating patient addresses (geocoding). Geocoding and linkage are such common classes of problem that they involve specialized software tools and techniques.  We’ll return to discuss further details of geocoding tools again in Module 8 when we discuss spatial analysis.

What is Data Cleaning?

What we call “data cleaning” in other cultural and disciplinary contexts is called data “cleansing”, “preparation”, “standardization”, “validation”, or “quality control”. Data cleaning is editing the data you have to make it sufficiently accurate and well structured to support the analysis you want to perform. The challenge of data cleaning is increasing the proportion of your dataset that fulfills your requirements without losing information while minimizing effort and time spent on errors, attributes, or records that have no significant impact on you analysis.

All of which is to say, data cleaning is a practical task. It is a trade off.

It does aim to make the data a “truer” representation of the patients and events you are interested in, but the end result is not anything perfectly clean. At several points in the process, you will encounter the limits of what you have the time to fix or the ability to know with the data you have on hand. Even in the best circumstances, undecidable anomalies and undetected errors will remain. This limitation is inherent in simplifying people into a database. It is the price we pay for the clarity that allows us analytic insight.

Data cleaning is also an unparallelled opportunity to get to know the data systems, people and institutions of your community. While we strive to make it quicker and more efficient, there is no substitute for having some portion of your hotspotting team dive into the granular details of the messy data while still keeping an eye on the dataset as a whole. Data cleaning is an iterative process of investigation, pattern detection, trial solutions and errors that lead to a deeper understanding of the data. While some amount of trial and error can be eliminated by forethought (and this toolkit) and the input of subject matter experts, getting to know your dataset is a valuable result, not just a side-effect of the cleaning process.

Just as we look to outliers for clues to systemic health system failure, we can look to data errors to grasp the limits of our analysis and of a hospital system’s understanding of its patients. Even if the whole process of cleaning could be automated, we would want someone to go behind the scenes to know what mess was being cleaned up.

How to Clean Data

A formal framework for understanding data cleaning is hard to find. Those that do exist tend to be very abstract, often academic. The small majority also seem to come from a “quality management” background, assuming that the reader is a data producer or a manager of industrial data production processes. In contrast, working with claims data, we are attempting to bend an existing data production system to meet secondary needs. We can do little if anything to control the production process. For the best chance of success we need to understand the structure, influences,  and sources of errors that we will mitigate after the fact. With that in mind, we attempt to synthesize the available theories and add to them our own experiences with person-level record wrangling to yield… 3.1 A Framework for Data Cleaning: Structures & Strategies.

With and understanding of the conceptual structure and approach to data cleaning, we can then move into the practical methods and heuristics in section 3.2 Data Cleaning Techniques. As always the specific steps you will use need to be adapted to the peculiarities of your dataset, but the techniques discussed should give you a strong place to start.

While all of the same general data cleaning concepts apply to data about patient identities, the attempt to link and reconcile disconnected, conflicting records that may or may not represent a single individual makes use of specialized tools, and domain-specific knowledge about names, birthdates, social security numbers, and family relationships, not to mention the various motives a person in complex and vulnerable circumstances might have for hiding these facts. To that end we will demonstrate one possible linkage tool, discussing both our general approach and some specific techniques and population-specific caveats that are avenues of ongoing research at the coalition in section… 3.3 Linking Patient Records.


  1. Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1). http://www.jstatsoft.org/v40/i01/paper.