3.2 Data Cleaning Techniques

You can identify data errors by asking…

How and where should you begin to identify errors? It can be tempting to simply begin correcting errors as you find them, but such an ad hoc approach will be impossible to apply broadly and consistently. Further, correcting some errors makes others easier to find or fix. To support that beneficial cascade of data improvement, we have arranged the following steps roughly in an “order of operations” that has worked well for us. Keep in mind, however, that each dataset will have its own unique challenges and the exact order of steps taken should vary with the requirements of the data. The important point is to understand how each cleaning step may enable or limit those that follow and to proceed through them deliberately.

AccountabilityReliability, and Completeness (of entities and attributes) represent the basic conditions for proceeding with any analysis of the data, essentially asking, “Do you trust the data?” “Did you get the data you expected?” and “Is there enough material there to work with?” If you have enough data, it makes sense to proceed by clarifying your understanding of the variables and codes in the database and noting any unresolvable ambiguities (Format Descriptiveness), improving Format Consistency, and then parsing variables into the appropriate constituent parts (Precision). In practice, you may approach consistency and precision through rounds of successive alteration since some reformatting is more easily done after complex fields are split into pieces and some splitting is more easily done after formats have been standardized. At this point address standardization, a component of geocoding could be used to both parse and standardize address data.

Once formats are standardized you can more comprehensively remove duplicate data and geocode addresses to spatial coordinates (Entity Uniqueness). Before addressing the uniqueness of patient entities through probabilistic record linkage, however, it is best to remove, flag. or correct intrinsically impossible values (Intrinsic Possibility). You can also make a first pass at identifying and establishing rules to resolve contradictory values (Logical Consistency). Just as with precision and format consistency, uniqueness and logical consistency have an iterative, reciprocal relationship in which adjustment to one calls for reevaluation of the other. Probabilistic linkage reveals newly knowable information about individuals, which , in turn, creates new risks of inconsistency.

Lastly you may try to address the most elusive qualities of the data, Veracity and Extrinsic Completeness: “Is it true?” and “Is anything missing?”

If we break the above description into discrete questions, each designed to address particular desirable qualities of the data , it looks something like this:

Specific questions for specific data qualities:

1. Accountability

In the broadest sense, accountability must be assumed at the outset of data cleaning. The assumption of propriety and good faith on the part of all the various data producers should only really be revisited if process or bias issues arise as the result of analysis.

1.1    Documentation? 

Do you have or can you get any documentation about who or what processes are responsible for the data? Who are the appropriate contacts if you have questions about the data creation process that the contact who delivered the data cannot answer?

1.1.1     Do you have or can you get a data dictionary defining the variables in the dataset?

1.2      Do you know the vintage of the dataset and its components?

1.3     If, when, and how is the data cleaned or verified by the providing institutions

1.3.1    Are there any known errors or limitations of the dataset?

1.3.2      When how and by whom are the data ever edited or updated? Is the fact or time of editing evident in the data  or is it invisible? Is the data provided in its final form or could it continue to change?

2. Reliability

Reliability becomes much more relevant on successive receipts of data extracts from the same source after the first installment.

2.1    Are the data consistent with what you received in past installments?

2.1.1    Have file formats or table structures changed?

2.1.2    Are variables missing or are new variables included?

2.1.3    Have variable names changed or is the order of the variables different?

2.1.4    Is the number of records in line with previous data and expected trends?

2.1.5    Have variable formats, data types, or data formats changed?

2.2       If the data are unreliable, can you explain why?

If the data prove to be unreliable, correct the problem, work around it, or request a corrected data extract from the provider.

3. Basic Entity Completeness (“Coverage”)

3.1      Are all of the types of entities you need included?

3.2      Is the count of each entity type close to what you would expect?

3.2.1    If the dataset seems incomplete, is unreliable or this is your first time receiving it, can you spot check a known sub-type of patients or specific people, events, or places to see that they are at least included?

4. Intrinsic Attribute Completeness

4.1      Are all of the variables included that should be?

4.2      Are the variables sufficiently populated to permit analysis?

4.3      Are missing values patterned in a way that suggests an omission in the data extract rather than a byproduct of the data creation process?

5. Formatting Descriptiveness

5.1      For any variable represented by codes, do you have definitions for the codes? Do the definitions make sense? Are all codes in the dataset defined and are all relevant definitions utilized in the data?

5.2      Are units of measurement, particularly for times and durations explicit? If not can they be inferred from context?

5.3      Do variable and table names clearly describe what they contain?

6. Formatting Consistency

6.1      Are values for a given variable consistently formatted?

6.2      Are missing values consistently formatted? Can inconsistency in their formatting be interpreted as meaningful?

Missing values are particularly important for several reasons, especially because their meaning is not always obvious and different software will format and treat them in different, often incompatible ways. In some cases missing values are skipped; in others they prevent a query or formula from working correctly. Improperly formatted missing values may be incorrectly treated as zeroes in other circumstances. Different types of missing data can also have different meanings. “Pt states no address” in the address field means the patient is homeless. While “Pt refuses” or a blank field should both be treated as uninterpretable missing data.

6.2.1 Types of NULL values

“Null” values Comments Compatibility
0 Indistinguishable from a true zero value none
NULL (a truly emply field) Hard to distinguish values that are missing from those overlooked on entry. Hard to visually distinguish blanks from spaces, which behave differently. In R and SQL it is supposed to be reserved for undefinable or undetermined / unknown values. R, Python, SQL, PHP
-999, 999, etc. Commonly used in survey coding, and sometimes in other manual data entry. Not recognized as null by many programs without explicit user intervention to define it as such. Can be inadvertently used as a number in calculations. SPSS
NA“, “na Can also be an abbreviation (e.g., North America), can cause prob- lems with data type (turn a numerical column into a text column). NA is more commonly recognized than na.In R it explicitly codes for a missing value. R, SAS
N/A An alternate form of NA, but often not compatible with software since “/” can have special meaning
None Uncommon. Can cause problems with data type Python
No data Uncommon. Can cause problems with data type, contains a space
Missing“, “UNK“, “Unknown“, etc. Uncommon. Can cause problems with data type
“,”+ Uncommon. Can cause problems with data type
.“, “._“, “.A” to “.Z Can cause problems with data type SAS
undef uncommon Perl
nil uncommon Ruby
null uncommon JavaScript

6.3     Are values formatted so that your chosen software will correctly interpret them? This can be a particular problem with dates and times. By historical accident and design fiat, different operating systems and software packages often have different ways of storing time data. Translation between date encodings is not always automatic and seamless.

7. Precision

7.1      Do the formats and their definitions tell you enough to distinguish important differences?

7.1.1    Are the values granular enough for your needs?

7.1.1.1  If the granularity is wrong, is this a limitation of the original data or of this particular dataset?

7.2      Do any variables combine attributes that should be parsed into separate variables?

7.3      Are any attributes truncated by their variables? If a variable is too narrow or of the wrong format, attribute values can be incorrectly trimmed.

8. Entity Uniqueness

8.1      Does the dataset contain exact duplicate records that can be removed?

8.2      Does the dataset contain approximate duplicates that need to be flagged for review and/or merged?

8.3      Probabilistic Linkage can occur here to address approximate duplicates records that refer to the same underlying person entity.

9. Intrinsic Possibility

By “intrinsic” possibility, we mean to indicate values that can be seen to be impossible without further knowledge of the data creation process or the subject area. Visits occurring in the future are intrinsic impossibilities. Visits occurring on a day that the emergency room was shut down by a water main break are extrinsic impossibilities that can only be detected with additional knowledge.

9.1      Does the data contain impossible values that can be eliminated or need to be excluded from analysis?

10. Logical Consistency

10.1      Does the data contain values in different variables or different records related to the same entity that contradict each other?

10.2      Can you determine a consistent, defensible rule for deciding in favor of one particular contradictory value and correcting the other(s)? If so you may choose to make the decision, but be sure to flag the results so that anyone using the data can see that a decision was made.

11. Veracity

Often, though this is ultimately the crux of good data, veracity is undecidable. If the number of cases is small, perhaps veracity can be confirmed with the patients or staff themselves, but this situation is rare. Occasionally we can determine that some unknown subset of the data must be false though we cannot determine what the true values should be. If too many patients have a particular value for some attribute (an improbable number of admissions on January 1st) then some unknown set of these must be false, but we don’t and possibly can’t, know which records are false.

11.1      Is the data “true”? Does it capture what actually happened? Is what actually happened and got recorded a fair representation of the patient?

12. Extrinsic Reality and Completeness

Knowing if the data covers all relevant real information and excludes fake information generally requires the most external knowledge of any data cleaning step and is impossible to empirically determine. If evidence of omissions and over-inclusions piles up, this may lead you to investigate the accountability and reliability of the data system.

12.1  For Attributes:

12.1.1 Are the values in the data the result of real events and attributes rather than byproducts of, or attempts to work around, the limitations of the data collection system? Are they artifacts of a data export or conversion process? Are they left over dummy data from a system test or staff training, etc.?

12.1.2 Are all of the relevant data in the data set or is some subset missing?

12.2  For Entities:

12.2.1 Are the records in the data the result of real events patients and places, rather than byproducts of, or attempts to work around, the limitations of the data collection system? Are they artifacts of a data export or conversion process? Are they left over dummy data from a system test or staff training, etc.?

12.2.2 Are all of the relevant patients events and places in the dataset or is some subset missing?

You can answer data cleaning questions by…

The general approaches to detecting data errors can be broadly grouped into two categories, descriptive techniques that we’ll call “profiling” and more focused techniques searching for errors of a known type that we’ll call “testing.” Profiling techniques assume that all data sets contain regular patterns at some scale, so they look for irregularities which may or may not be errors. Testing techniques search for errors identified from external data or past experience. These could also be called “exploratory” and “experimental” techniques, since in the former we don’t know what we’ll find, and in the latter we approach the data with a specific hypothesis. We’re not going to use those labels only because they grant the cleaning process a bit more rigour than it warrants.

Any automated process of data cleaning (like probabilistic linkage) is an experimental, testing approach that already has its hypothesis baked in. To know if the tool and its unstated hypothesis will be helpful you need to understand what that hypothesis is and if it fits with the types of error to which your data are most susceptible.

Next: 3.3 Record Linkage

Footnotes