3.1 Data Cleaning Framework

  1. Definitions
  2. What are clean data like?
  3. Why are data dirty?
  4. What are the steps of data cleaning?

The tables below are presented in summary form. Click on the “details”, “examples”, and “synonyms” links to expand the text for greater depth.

Cleaner language for cleaner data…

One aim of data cleaning is making the data more precisely descriptive, better able to communicate to multiple users. That effort can itself be very difficult to describe without defining some specialized terms. Terms in this table are bolded when they are used in the definition of another term in the table (with the exception of “data”, which occurs too frequently).

term definition
data information that has been recorded
error anything wrong with data  see details see synonyms
dataset a particular collection of data  see synonyms
data system anything outside of a dataset which creates or shapes the data  see details
entity any general category of noun that you might wish to represent in a dataset  see synonyms
attribute any adjective or descriptive quality of an entity that you might wish to represent in a dataset  see details see synonyms
schema the set of choices made about how to represent entities of interest in a database  see details see synonyms
variable a named data element used for storing the particular values of an attribute of an entity  see synonyms
table a logical description of a collection of similarly structured data in which, generally, entities are represented by rows and attributes are represented by columns containing variables  see details  see synonyms
data element an individual piece or unit of data  see details
metadata data about data, including relationships among data elements and relationships between data and the world outside the dataset  see synonyms
database a software system for storing data and metadata in an organized way.   see details  see synonyms
relationship the logical connections between data elements  see details   see synonyms
flat data data for which different types of entities are stored in a single table see details
relational data data for which relationships are encoded in the structure of the database, and different types of entities (e.g., patients and visits) are stored in separate tables see details
instance a particular example of a data element see synonyms
id field a variable used to identify or refer to an instance of an entity in a dataset  see synonyms
unique id a data value which is only associated with one instance of a given entity see synonyms
value the particular data stored in an instance of a variable
data quality flag a variable used to explicitly indicate the presence, suspicion, and/or type of an error in a given record or variable
intrinsic evidence information about the presence or absence of an error that can be gleaned by using only the dataset containing the error
extrinsic evidence information about the presence or absence of an error that can be gleaned only by looking outside of a given dataset see details
normalize to reduce the redundancy in a database by storing distinct types of entities in distinct tables and clearly defining the relationships between them see details
data-type the definition of the kinds of values a variable can store  see details
wide / narrow a table is “wide” when it has a large number of variables or columns, a variable is “wide” when its data-type allows for the storage of much more information than the variable requires. “Narrow” tables store few attributes in many rows. see details

With some basic data and metadata terminology defined, we can describe what we mean by “clean” data.

Clean data display the qualities of…

Data Cleanliness Typology
data quality definition
Extrinsic Qualities
Accountability The data creation path can be traced.  see details  see examples
Reliability The data are consistently generated in a robust way.  see details  see examples
Completeness The dataset includes everything it should.  see details  see examples
Reality The dataset excludes anything fake.  see details  see examples
Veracity The information in the dataset is true.  see details  see examples
Intrinsic Qualities
Uniqueness Unique entities in the data occur only once.  see details  see examples
Logical Consistency The data do not contradict each other.  see details  see examples
Possibility No variables have values that are impossible.  see details  see examples
Precision Values have as much detail as is available.  see details  see examples
Format Consistency Variables of the same type follow the same pattern.  see details  see examples
Format Descriptiveness It is clear what each variable describes.  see details  see examples

No real dataset is completely clean. Even if it were, we would never know it. Clean data qualities like Reality, Uniqueness, and Veracity are mostly susceptible to extrinsic evidence, and we can never exhaust the sources of external (and often inaccessible) evidence which could have the power to undermine the cleanliness of the data. On the other hand, this very real limit to what we can know about the data does not prevent us from productively interrogating a dataset for errors.

Before you begin interrogating the data, a sympathetic understanding of how errors are generated can help you anticipate and make sense of the errors you will discover.

Dirty data is created by…

Sources of Data Error
type source definition
technical Malfunction The data system didn’t work as designed. The technology failed.  see examples
technical System Limitation The design or standard operating procedure of the data production system distorted the data.  see examples
clerical Input Mistake Data entry created errors.   see examples
clerical Editing Mistake Data modification created errors.   see examples
semantic Manipulation Data were intentionally distorted by a participant in the data production process   see examples
semantic Misperception A participant in the data production process misread, misheard, or incorrectly saw the information to be recorded.  see examples
semantic Misunderstanding Participants in the data production process did not know (or know how) to correctly fulfill the intent of their data production role.  see examples

With a sense of the variety, origin, and inevitability of data errors, you can begin to clean up errors without an unreasonable expectation of data perfection.  In its ideal form, the process of data cleaning follows the same steps for each error.

Data cleaning cycles through these steps…

Data Cleaning Workflow
process definition
Identify What is the data error you see? Is it really an error at all?
Quantify How common and extensive is the error? Is it pervasive enough (in the entire dataset or among a critical subset of records) to warrant remediation? Whether or not the error pervades the dataset, is it easy enough or even possible to fix with a reasonable amount of time and effort?
Document Error Characterize the error, its suspected causes, and its prevalence.
Decide Choose an appropriate response to the error. Sometime the response will be to simply document the error.
Document Solution Describe your chosen method of addressing the error. If this includes writing or modifying a script or code (e.g. a database query, regular expression pattern, R script, mathematical formula, a logical decision protocol, etc.), include that code. Include not just what you will do but why. If you later need to modify your solution, inferring the intent of prior changes made to the data may only be possible if you have written it down.
Resolve Apply your chosen response to the error.
Confirm Check the data to confirm that your solution performed as you expected. If not, roll-back the change and repeat the necessary prior steps. Did you:

Mis-specify the problem? Return to Identify.

Misunderstand the data? Return to Quantify.

Choose the wrong solution? Return to Decide.

Mis-implement the solution? Return to Document Solution.

Document Results When you are satisfied with the resolution of a given error, document the results of the change if there is any relevant information not contained in your description of the problem and your chosen solution method.
Repeat Repeat this process for each data error you identify.

 

We stress repeated moments of documentation in the process outlined above for three reasons.

  1. First, documentation is the step most often skipped, since it doesn’t, by itself, get you any closer to using the data. As we stated at the beginning of this module, data cleaning is a practical task and documentation can seem like an academic exercise of describing rather than doing. However, in any iterative process of exploration like data cleaning, an investment in documentation early on will quickly pay off.
  2. Second, over-cleaning your data in a way that removes meaningful variation is a real risk1. The only way to recognize and recover from over-cleaning, especially if you don’t discover it immediately, is to rely on thorough documentation of what you did  and why.
  3. Third, at some point you will have to repeat the cleaning process on a new dataset, or a different person will contribute to your work and need to understand what you did, or you will have to pause the cleaning process and then pick up where you left off. In all of those scenarios, you will be glad to have the documentation. You will be able to proceed more quickly, with greater fidelity, and have some chance of improving your process with each iteration.

Next: 3.2 Data Cleaning Techniques

Footnotes


  1. Randall, Sean M, Anna M Ferrante, James H Boyd, and James B Semmens. 2013. “The Effect of Data Cleaning on Record Linkage Quality.” BMC Medical Informatics and Decision Making 13 (64). http://www.biomedcentral.com/1472-6947/13/64.