The tables below are presented in summary form. Click on the “details”, “examples”, and “synonyms” links to expand the text for greater depth.
Cleaner language for cleaner data…
One aim of data cleaning is making the data more precisely descriptive, better able to communicate to multiple users. That effort can itself be very difficult to describe without defining some specialized terms. Terms in this table are bolded when they are used in the definition of another term in the table (with the exception of “data”, which occurs too frequently).
|data||information that has been recorded|
|error||anything wrong with data see details
anything that makes the data less true and useful than it could be, Data errors may– or may not –be intentional, avoidable, consequential, fixable, or detectable.see synonyms
anomaly, dirtiness, data quality issue
|dataset||a particular collection of data see synonyms
data dump, data file(s), data store, data extract
|data system||anything outside of a dataset which creates or shapes the data see details
This includes the patients represented in the data as well as the technology used to store it, the individuals and devices that record and modify it, and the institutions which use it.
|entity||any general category of noun that you might wish to represent in a dataset see synonyms
object, table (in a well-designed relational database), node (in graph data)
|attribute||any adjective or descriptive quality of an entity that you might wish to represent in a dataset see details
In more generalized terms and in some more recent types of database designed for greater flexibility, all information including entities are represented as triplets of attributes (subject-predicate-object). This approach is common for large volume internet accessible databases, but not (yet?) in most healthcare applications.see synonyms
quality, feature, facet
|schema||the set of choices made about how to represent entities of interest in a database see details
what to include or exclude, what to group together or split apart and how data elements should be named and related. The schema sets a limit on how accurately a database can represent the attributes and entities it is designed to record.see synonyms
database schema, data scheme, database structure
|variable||a named data element used for storing the particular values of an attribute of an entity see synonyms
column (in a table), field
|table||a logical description of a collection of similarly structured data in which, generally, entities are represented by rows and attributes are represented by columns containing variables see details
A database also gives a table a name and other more or less explicit structural rules that control what can be entered in it and how that data will be formatted and stored.see synonyms
|data element||an individual piece or unit of data see details
A “logical” element is a descriptive or abstract element or type of element that is or could be part of a database schema. A “practical element” is an actual part of a database. Entities and attributes are logical elements; tables and columns are one way of practically representing those elements in a database. Variables and their values are also practical elements. A logical element may describe the relevant features of something in the world, like a patient, more or less well. A practical element like a string variable called “SURNAME” that can hold 20 characters may represent a logical attribute like a name more or less well (e.g. the name may be longer than 20 characters or may contain accented characters that are not included in the variable’s value.)
|metadata||data about data, including relationships among data elements and relationships between data and the world outside the dataset see synonyms
|database||a software system for storing data and metadata in an organized way. see details
A database in this sense may be either a single electronic file or a program for creating and manipulating electronic files and data.see synonyms
Db, DbMS (Database Management System), RDbMS (Relational Database Management System, one particular kind of Db)
|relationship||the logical connections between data elements see details
Relationships can usually be translated into verb + preposition + article phrases like “belongs to a”, “has a”, “attended one or more of the”, “diagnosed each of”, “was diagnosed by this”, “was diagnosed as”, etc.see synonyms
relation, tie (in graph data), edge (in graph data), dependency
|flat data||data for which different types of entities are stored in a single table see details
Storing data flatly often requires information to be repeated as when patient encounter data repeats patient demographics on each row. This redundancy simplifies data storage, retrieval and readability, but introduces the risk of inconsistency and complicates the data cleaning and editing process.
|relational data||data for which relationships are encoded in the structure of the database, and different types of entities (e.g., patients and visits) are stored in separate tables see details
Just because data is stored relationally does not necessarily mean that it is cleaner or better defined than flat data. Relationality only suggests that a given dataset has less redundancy and the possibility for more consistent editing, than it would if it were stored in a single table.
|instance||a particular example of a data element see synonyms
case, token, record, row (in a relational table a row is an instance of an entity, in a flat table a row is an instance of 1 or more entities and 1 or more relationships), value (an instance of a variable)
|id field||a variable used to identify or refer to an instance of an entity in a dataset see synonyms
id, key, identifier
|unique id||a data value which is only associated with one instance of a given entity see synonyms
primary key, uid
|value||the particular data stored in an instance of a variable|
|data quality flag||a variable used to explicitly indicate the presence, suspicion, and/or type of an error in a given record or variable|
|intrinsic evidence||information about the presence or absence of an error that can be gleaned by using only the dataset containing the error|
|extrinsic evidence||information about the presence or absence of an error that can be gleaned only by looking outside of a given dataset see details
supplementing it with additional subject area knowledge, additional data collection, comparison to a different dataset or database, or information about the data creation process which is not captured in the available data or metadata
|normalize||to reduce the redundancy in a database by storing distinct types of entities in distinct tables and clearly defining the relationships between them see details
Normalization reduces the chances of storing contradictory information. Normalization helps maintain the integrity of the data at the cost of requiring more forethought about the data’s schema and more complex queries to analyze and manipulate the data. Making flat data relational requires normalization, but relational data can often be further normalized. The desirability of normalization depends on the practical requirements of your data sytem and your intended analysis. The opposite of normalization is denormalization or flattening.
|data-type||the definition of the kinds of values a variable can store see details
Most databases define several data-types, and a given variable must select one. Data-type can affect how much information a given variable can store in each record, what kinds of operations can be performed on that data, and in what formats it will be displayed. As a result, inadvertent conversion or incorrect selection of data-type may be one source of error. For example, ICD-9-CM diagnosis codes look like decimal numbers, but they are actually 3 to 5 digit character strings in which leading and trailing zeros are meaningful. Conversion to a numeric data-type destroys the leading and trailing zero information and distorts the meaning of the data. Data-types may be more or less specific depending on the software in question. Common types include character strings, variable character strings, integers (of various size), fixed and floating point decimals, vectors, lists, matrices, factors, etc. On import and export or when data-types are not explicitly defined different software makes different default choices and assumptions about data-type.
|wide / narrow||a table is “wide” when it has a large number of variables or columns, a variable is “wide” when its data-type allows for the storage of much more information than the variable requires. “Narrow” tables store few attributes in many rows. see details
Some software works more easily with either narrow or wide data. In most traditional databases, wide data require more memory and time to process. Practically, such tables are often also hard to read because it is physically difficult to see all of a record at one time. Flat and other less normalized tables are often wider than than their more normalized counterparts.
With some basic data and metadata terminology defined, we can describe what we mean by “clean” data.
Clean data display the qualities of…
|Data Cleanliness Typology|
|Accountability||The data creation path can be traced. see details
You know who is responsible for creating and maintaining the data and who to contact if you have questions. In practice, this also implies that you trust, until you have evidence to the contrary, that the data were created free from intentional or easily avoidable bias. This trust in the good faith of the data production process also implies that you have a mental model of how the data are created, one sufficiently detailed to let you imagine how errors may have been unintentionally introduced and therefore how their impact may be mitigated without introducing new errors.see examples
e.g., Emergency Department patient intake data is entered into a different database than inpatient data by frontline clerical staff at 3 sites. Staff at only one of these sites is regularly fluent in Spanish. Clinical staff may revise demographic details but they rarely do so since these changes get overwritten by the front desk system on subsequent visits. The ED has its own database administrator. You have her email. Data are later reviewed in the billing and coding department.
|Reliability||The data are consistently generated in a robust way. see details
Unstated units, definitions, and processes are stable from one received dataset to the next. Changes in layout or logical structure are minimal or at least their timing and scope of impact are documented.see examples
e.g., In 2012 patient gender was filled in by clerical staff as a text field. Starting in June 2013, gender was coded as multiple choice from a patient filled form leading to some null and “other” values. Your contact told you about the change when you got a new extract of the data.
|Completeness||The dataset includes everything it should. see details
All relevant, real people, places, and things and their attributes are included in the data.see examples
e.g., A new emergency room was opened in mid-December 2013. The two weeks of data are nonetheless included in your 2013 dataset. Empty fields represent inapplicable, unknown, or unknowable information, not errors of omission nor lost data.
|Reality||The dataset excludes anything fake. see details
No fake (dummy, test, synthetic, placeholder, or erroneously generated) people, places, things, or attributes are included in the dataset. No data refer to something that did not happen or does not exist.see examples
e.g., Before the new emergency room opened in mid-December 2013 newly hired staff were trained in data entry using the live system, in some cases making updates to real patient records, in others entering test data. All of these test records were scrubbed and the changes were rolled back before you received the data.
|Veracity||The information in the dataset is true. see details
Data measure what they are intended to measure with negligible error. Measures are equal to or close to their true value. Veracity also covers correct spelling and use of the correct terminology.see examples
e.g., Patients who are listed as homeless are in fact homeless at the time of intake (not merely unconscious, without identification, or “looking homeless”).
|Uniqueness||Unique entities in the data occur only once. see details
No unconsolidated duplicate people, places, things, or events are included in the data with different “unique” identifier fields. Each unique object occurs in only one record.see examples
e.g., Occasionally patients concerned about their privacy and wary of law enforcement personnel often present in the emergency room give nicknames resulting in the creation of new patient records. After probabilistic matching most of these duplicates are consolidated.
|Logical Consistency||The data do not contradict each other. see details
Where values or variables overlap, depend on or necessarily imply each other in what they refer to or describe, they don’t contradict. The measures add up.see examples
e.g., Patients listed as giving birth in the Emergency Department are also listed as female and not 80 years old. Patients listed as Dead On Arrival do not also have subsequent visits to the emergency room.
|Possibility||No variables have values that are impossible. see details
The values of variables do not contradict known rules not otherwise captured in the database.see examples
e.g., No patients are 200 years old.
|Precision||Values have as much detail as is available. see details
Values have enough detail for your analysis or at least as much as was originally entered. They have not been truncated, rounded, or abbreviated. A given variable does not contain more or less information than it should.see examples
e.g., Patient names are separated into distinct variables for first, last part1, last part 2, middle name and name suffix. Admissions have both date and time. Time is recorded in 24 hr format or includes an AM/PM designation.
|Format Consistency||Variables of the same type follow the same pattern. see details
Variables expressing the same type of values or measured in the same or comparable units should follow the same pattern in all records.see examples
e.g., Financial values are all in positive numbers. Social security numbers are all listed as 9 character text strings with no dashes. If different hospitals use different classifications of race and ethnicity, the definitions have been harmonized to the extent possible without obscuring real underlying distinctions.
|Format Descriptiveness||It is clear what each variable describes. see details
The data, documentation, variable name, and/or database structure make it clear what is being described by a given variable.see examples
e.g., Variables have human readable names like “surname,” and “diagnosis1″. A variable received as “DD” has been renamed “discharge_disposition” and the codes “1” through “9”, “x”, and “V” it contains are all clearly defined in a linked lookup table within your database. Any undecidable ambiguity remaining in these definitions is itself noted in the lookup table. If definitions are know to have changed over time, these changes are noted in the lookup table with dates and the variable has been recoded to reduce confusion as much as possible.
No real dataset is completely clean. Even if it were, we would never know it. Clean data qualities like Reality, Uniqueness, and Veracity are mostly susceptible to extrinsic evidence, and we can never exhaust the sources of external (and often inaccessible) evidence which could have the power to undermine the cleanliness of the data. On the other hand, this very real limit to what we can know about the data does not prevent us from productively interrogating a dataset for errors.
Before you begin interrogating the data, a sympathetic understanding of how errors are generated can help you anticipate and make sense of the errors you will discover.
Dirty data is created by…
|Sources of Data Error|
|technical||Malfunction||The data system didn’t work as designed. The technology failed. see examples
e.g., computer crashes, input device errors, computer memory errors, data transmission errors
|System Limitation||The design or standard operating procedure of the data production system distorted the data. see examples
e.g., the data storage system inserts default values where an attribute should be left blank, measurement units are unspecified though multiple options are plausible (days? hours? dollars? cents?), data format is not restricted (Both “12/31/2008″ and “31/12/2008″ are allowed, so what does “12/12/2008″ mean?), data format is inappropriately restricted (The “given_name” field allows 18 characters but 21 character long “Anandavenkatakrishnan” is a real name.), lack of transaction control (Simultaneous edits of patient records by multiple users are poorly resolved by the database), data model errors (The database was designed without the concept of patients “under observation” in mind. Instead they show up as being discharged and readmitted one minute later but lacking some of the data that would normally attend a new admission.), inconsistent timestamps for events crossing calendar boundaries (Some events are recorded based on when data entry begins, others when they end, which causes ambiguity for events entered around midnight, January 1st, and the end of a month or quarter.)
|clerical||Input Mistake||Data entry created errors. see examples
e.g., typos, misspellings, selecting the option above or below the intended one in a menu, transposition of data fields, skipping of data fields shifts input into the wrong variables
|Editing Mistake||Data modification created errors. see examples
e.g., incompatible file format conversion during data import or export (You receive some data in Microsoft Excel 2003 format which discards rows after the 65,536th. Comma delimited data include commas that are not properly escaped, leading to incorrect assignment of values to variables. Dbf data treats some columns as fixed-width and truncates them.), editing the wrong field, incorrect query used to extract your dataset omitting records or joining the wrong database tables together
|semantic||Manipulation||Data were intentionally distorted by a participant in the data production process see examples
e.g., data provider did not fully fulfill the data sharing agreement, withholding data, Data entry personnel use a field to record information it was not well designed to hold, supplementing the existing data schema with unanticipated information, as when “Unresponsive” is entered into the name field. Occasionally patients concerned about their privacy and wary of the law enforcement personnel often present in the emergency room give false names and addresses.
|Misperception||A participant in the data production process misread, misheard, or incorrectly saw the information to be recorded. see examples
e.g., misread handwriting in a paper chart, misheard accent, An unconscious patient who would self-identify as non-Hispanic was assumed to be Hispanic.
|Misunderstanding||Participants in the data production process did not know (or know how) to correctly fulfill the intent of their data production role. see examples
e.g., misdiagnosis, poorly trained intake or coding staff, inconsistent definitions of variables used by different personnel, misinterpretation of data recording protocols in ambiguous situations, Data protocol documentation inconsistencies led to incorrect procedure interpretation.
With a sense of the variety, origin, and inevitability of data errors, you can begin to clean up errors without an unreasonable expectation of data perfection. In its ideal form, the process of data cleaning follows the same steps for each error.
Data cleaning cycles through these steps…
|Data Cleaning Workflow|
|Identify||What is the data error you see? Is it really an error at all?|
|Quantify||How common and extensive is the error? Is it pervasive enough (in the entire dataset or among a critical subset of records) to warrant remediation? Whether or not the error pervades the dataset, is it easy enough or even possible to fix with a reasonable amount of time and effort?|
|Document Error||Characterize the error, its suspected causes, and its prevalence.|
|Decide||Choose an appropriate response to the error. Sometime the response will be to simply document the error.|
|Document Solution||Describe your chosen method of addressing the error. If this includes writing or modifying a script or code (e.g. a database query, regular expression pattern, R script, mathematical formula, a logical decision protocol, etc.), include that code. Include not just what you will do but why. If you later need to modify your solution, inferring the intent of prior changes made to the data may only be possible if you have written it down.|
|Resolve||Apply your chosen response to the error.|
|Confirm||Check the data to confirm that your solution performed as you expected. If not, roll-back the change and repeat the necessary prior steps. Did you:
Mis-specify the problem? Return to Identify.
Misunderstand the data? Return to Quantify.
Choose the wrong solution? Return to Decide.
Mis-implement the solution? Return to Document Solution.
|Document Results||When you are satisfied with the resolution of a given error, document the results of the change if there is any relevant information not contained in your description of the problem and your chosen solution method.|
|Repeat||Repeat this process for each data error you identify.|
We stress repeated moments of documentation in the process outlined above for three reasons.
- First, documentation is the step most often skipped, since it doesn’t, by itself, get you any closer to using the data. As we stated at the beginning of this module, data cleaning is a practical task and documentation can seem like an academic exercise of describing rather than doing. However, in any iterative process of exploration like data cleaning, an investment in documentation early on will quickly pay off.
- Second, over-cleaning your data in a way that removes meaningful variation is a real risk1. The only way to recognize and recover from over-cleaning, especially if you don’t discover it immediately, is to rely on thorough documentation of what you did and why.
- Third, at some point you will have to repeat the cleaning process on a new dataset, or a different person will contribute to your work and need to understand what you did, or you will have to pause the cleaning process and then pick up where you left off. In all of those scenarios, you will be glad to have the documentation. You will be able to proceed more quickly, with greater fidelity, and have some chance of improving your process with each iteration.