What is linkage?
Record linkage is the process of identifying data records that refer to the same person, object, or event across data sources.
Record linkage has many practical applications; it is used most frequently to create master files for populations or to extend record-level information in one data set with information from another. In hotspotting, record linkage is used to link patient information across different hospitals so that ED and inpatient use can be analyzed.
Modern computer-based record linkage methods fall into two categories, deterministic linkage and probabilistic linkage.
Deterministic linkage identifies a match between records by comparing unique identifiers or using an exact comparison between fields. In some software programs, deterministic linkage may also employ almost-exact comparisons based on predetermined rules defined by users. Deterministic linkage is not the best option if your data has quality issues, as all but the most trivial datasets do.
Probabilistic linkage calculates a total score for two records to determine how likely it is that both refer to the same individual. The total score is the sum of scores generated by the comparison of individually weighted fields. Based on the total score, record pairs will fall above or below predetermined thresholds and will be identified as matches, non-matches, or uncertain matches which should get closer scrutiny before determining whether they should be considered a match or not. (...read more about probabilistic linkage)
For each pair compared, two conditional probabilities are computed. The ratio of these two conditional probabilities, R, is the odds ratio demonstrated in the figure below.
m is the probability of agreement for a given comparison. (Because all matching variables are subject to some type of error the m probability is less than 1).
u is the probability that two records not belonging to the same individual agree purely by chance.
Linkage software packages (like those listed at the end of this section) can implement and embellish the basic Felligi-Sunter model in various ways. For example they may allow you to adjust the relative importance (or “weight”) of each field in the calculation of the final match score for each record pair so that fields often known to be less reliable, such as middle names, have a minor contribution to the match. This type of tailored weighting scheme can be as complex as the software will allow.
One useful feature of some software is the ability to have different weights in the case of agreement and disagreement so that home addresses that agree enhance the match more strongly than addresses that disagree are allowed to weigh against it. Even more flexibility is afforded if fields can be conditionally weighted based on the values in other fields. For example, since women can change their surname when they marry, a female record pair with different surnames would receive less of a match penalty than a male record pair with different surnames. Birth dates can be compared as a sequence of digits to catch typographical errors, and they can be compared as dates for which a difference of 154 days or less means that two very similar individuals are more likely to be the same individual or in rare cases, twins, than they are to be siblings.
Further, exactly how the similarity of two fields is calculated offers many possible variations. Where data is entered by an individual listening to a patient report his or her name, algorithms that translate names into their phonetic equivalents may be useful, though these have often been designed with English sounds in mind. Mathematical string comparisons based on the number of omissions, additions, and transpositions required to transform one string into another are more broadly applicable but less well adapted to specific situations.
For all of these reasons it is important for best results and to avoid unanticipated bias in your results to understand the characteristics of your data, your patient population, and the methods underlying the software you chose.
What is the probabilistic linkage process?
- Select variables that you will use in the matching process. These are the variables you will want to prioritize for cleaning. Common variables include Patient Name, Date of Birth, Sex, Social Security Number, Medical Record Number, Ethnicity, Race, Address. It is important to explore your data before you begin the linkage process. Some variables will be less reliable than others. You can remove unreliable variables from the linkage process or change the weighting system accordingly so record pairs are not penalized for not matching on that variable.
- Reformat your data
- Create new fields for each variable you will be cleaning.Copy original fields into these new fields and start cleaning and standardizing the copied data while leaving the original fields untouched. Retaining the unedited original fields allows you to link generated match ID’s (the end result of the linkage process, indicating which records should be grouped together as belonging to the same person) back to the original datasets.
- Parse variables into component parts. This includes separating First Name, Middle Name, Last Name, and Suffix into individual columns.
- Standardize matching variable formats. This includes stripping social security numbers of dashes, checking that all dates are in the date format, and ensuring all ID numbers were imported as text (to retain leading zeros). Add the hospital’s first letter (or unique first letters) to the beginning of MRNs to avoid the possibility of duplicates across hospitals.
- Clean you data.
- Clean out obviously invalid data. For example, dates that are in the future or “x” for sex when no coding is used. Data cleaning can be a time-consuming process but can have a significant impact on the quality of your linkage results. For more information on cleaning, refer to section 3.1.
- Group on fields that will be imported into the linkage software. This will remove exact duplicates and speed the matching process.
- Prepare your data for import
- Minimize variable widths. Query to find the maximum length of each field and resize accordingly.
- Create synthetic / analytic variables [e.g. new sub-string variables if partial values are common]
- Partition the tables to create one with only matching variables and unique record ID
If you are using LinkageWiz, your data is ready to import. Click here for to view step-by-step instructions on how to use LinkageWiz.
- Create a new project
- Import Data
- You have the option of importing data in one table or multiple. You will need to specify the data type for each variable. If you are importing multiple tables, it is important to keep the data type of the same variables consistent.
- Navigate to Define Match Variables -> Assign those variables that correspond to preset variables in LinkageWiz. You may need to add categories for unique identifiers used by your hospital system.
- Define weights for match variables without existing roles
- Assess the risks and cases for false positives and false negatives
- Set blocking variables
- Run the match
- Validate at the top and around the threshold
- Manually check subsets
- Tweak, iterate
- Assess the non-overlap in the matched population between iterations
- Tweak, iterate
- Manually review high priority groups / sample and determine an acceptable estimate error rate
- Manually split/match high priority groups / sample to achieve desired error rate and determine minimally burdensome rules for manual split/match
Aside from LinkageWiz, there are several software programs you can choose from to complete the linkage process. The table below offers a comparison of features of some of the more popular options.
|Price||Last Listed Update||Input Data Format||Has GUI for reviewing matches?||String Encoding & Comparison||Data cleaning tools avaiable?|
|SQL, XML, or flat-file database||Unknown||Phonetic encoding (unknown algorithm)||Unknown|
|Febrl||Free||2013 (v 0.4.2)|
|Text files||No||Soundex, Double Metaphone, JaroWinkler, q-gram ...26 methods in all ||Yes|
|FRIL||Free||2011 (v 2.1.5)|
|Text files, MS Excel files, MS Access files, MSSQL Server, Oracle, MySQL and PostgreSQL||Yes||Soundex, edit-distance (algorithm unknown), q-gram||Yes|
|Link King||Free |
(SAS license required)
|2012 (v 7.1.21)|
|Text files, SAS datasets, SPSS portable files, MS Excel files||Yes||Approximate string matching algorithm, SAS's spedis function, Soundex, NYSIIS||Yes|
|LinkageWiz||$2,999 for unlimited records|
$1,200 for 500,000 records per table
free for 2,000
|2013 (v 5.5.42)|
|Text files, MS Access files, Dbase III, IV, V, Paradox 3.x-5.x, MS Excel files||Yes||Soundex, NYSIIS, Levenshtein ||Yes|
|LinkPlus||Free||2007 (v 2.0)|
|Text files and North American Association Central Caner Registries files||Yes||Soundex, NYSIIS||No/unknown|
|RecordLinkage||Free||2015 (v 0.4-7)|
R RecordLinkage Package
|Text files||No||Soundex, German language algorithm pho_h, JaroWinkler, Levenshtein||No|
|RELAIS||Free||2011 (v 2.2.1)|
|Text files||No||Soundex, 3-grams, Dice, Jaro, JaroWinkler, Levenshtein||Yes|