3.3 Probabilistic Linkage

What is linkage?

Record linkage is the process of identifying data records that refer to the same person, object, or event across data sources.  

Record linkage has many practical applications; it is used most frequently to create master files for populations or to extend record-level information in one data set with information from another.  In hotspotting, record linkage is used to link patient information across different hospitals so that ED and inpatient use can be analyzed.


Modern computer-based record linkage methods fall into two categories, deterministic linkage and probabilistic linkage.

Deterministic Linkage

Deterministic linkage identifies a match between records by comparing unique identifiers or using an exact comparison between fields.  In some software programs, deterministic linkage may also employ almost-exact comparisons based on predetermined rules defined by users.  Deterministic linkage is not the best option if your data has quality issues, as all but the most trivial datasets do.

 Probabilistic Linkage

Probabilistic linkage calculates a total score for two records to determine how likely it is that both refer to the same individual.  The total score is the sum of scores generated by the comparison of individually weighted fields.  Based on the total score, record pairs will fall above or below predetermined thresholds and will be identified as matches, non-matches, or uncertain matches which should get closer scrutiny before determining whether they should be considered a match or not. (...read more about probabilistic linkage)

What is the probabilistic linkage process?

    1. Select variables that you will use in the matching process. These are the variables you will want to prioritize for cleaning.  Common variables include Patient Name, Date of Birth, Sex, Social Security Number, Medical Record Number, Ethnicity, Race, Address.  It is important to explore your data before you begin the linkage process.  Some variables will be less reliable than others.  You can remove unreliable variables from the linkage process or change the weighting system accordingly so record pairs are not penalized for not matching on that variable.
    2. Reformat your data
      1. Create new fields for each variable you will be cleaning.Copy original fields into these new fields and start cleaning and standardizing the copied data while leaving the original fields untouched.  Retaining the unedited original fields allows you to link generated match ID’s (the end result of the linkage process, indicating which records should be grouped together as belonging to the same person) back to the original datasets.
      2. Parse variables into component parts.  This includes separating First Name, Middle Name, Last Name, and Suffix into individual columns.
      3. Standardize matching variable formats.  This includes stripping social security numbers of dashes, checking that all dates are in the date format, and ensuring all ID numbers were imported as text (to retain leading zeros).  Add the hospital’s first letter (or unique first letters) to the beginning of MRNs to avoid the possibility of duplicates across hospitals.
    3. Clean you data.
      1. Clean out obviously invalid data.  For example, dates that are in the future or “x” for sex  when no coding is used. Data cleaning can be a time-consuming process but can have a significant impact on the quality of your linkage results.  For more information on cleaning, refer to section 3.1.
    4. Group on fields that will be imported into the linkage software.  This will remove exact duplicates and speed the matching process.
    5. Prepare your data for import
      1. Minimize variable widths.  Query to find the maximum length of each field and resize accordingly.
      2. Create synthetic / analytic variables [e.g. new sub-string variables if partial values are common]
      3. Partition the tables to create one with only matching variables and unique record ID

LinkageWiz Steps

If you are using LinkageWiz, your data is ready to import.  Click here for to view step-by-step instructions on how to use LinkageWiz.

  1. Create a new project
  2. Import Data
    1. You have the option of importing data in one table or multiple. You will need to specify the data type for each variable.  If you are importing multiple tables, it is important to keep the data type of the same variables consistent.
  3. Navigate to Define Match Variables -> Assign those variables that correspond to preset variables in LinkageWiz.  You may need to add categories for unique identifiers used by your hospital system.
  4. Define weights for match variables without existing roles
  5. Assess the risks and cases for false positives and false negatives
  6. Set blocking variables
  7. Run the match
  8. Validate at the top and around the threshold
  9. Manually check subsets
  10. Tweak, iterate
  11. Assess the non-overlap in the matched population between iterations
  12. Tweak, iterate
  13. Manually review high priority groups / sample and determine an acceptable estimate error rate
  14. Manually split/match high priority groups / sample to achieve desired error rate and determine minimally burdensome rules for manual split/match


Software Options

Aside from LinkageWiz, there are several software programs you can choose from to complete the linkage process.  The table below offers a comparison of features of some of the more popular options.

 PriceLast Listed UpdateInput Data FormatHas GUI for reviewing matches?String Encoding & ComparisonData cleaning tools avaiable?
SQL, XML, or flat-file databaseUnknownPhonetic encoding (unknown algorithm)Unknown
FebrlFree2013 (v 0.4.2)
Text filesNoSoundex, Double Metaphone, JaroWinkler, q-gram ...26 methods in all
FRILFree2011 (v 2.1.5)
Text files, MS Excel files, MS Access files, MSSQL Server, Oracle, MySQL and PostgreSQL YesSoundex, edit-distance (algorithm unknown), q-gramYes
Link KingFree
(SAS license required)
2012 (v 7.1.21)
Link King
Text files, SAS datasets, SPSS portable files, MS Excel files
YesApproximate string matching algorithm, SAS's spedis function, Soundex, NYSIISYes
LinkageWiz$2,999 for unlimited records

$1,200 for 500,000 records per table

free for 2,000
2013 (v 5.5.42)
Text files, MS Access files, Dbase III, IV, V, Paradox 3.x-5.x, MS Excel filesYesSoundex, NYSIIS, Levenshtein
LinkPlusFree2007 (v 2.0)
Link Plus
Text files and North American Association Central Caner Registries files
YesSoundex, NYSIIS
RecordLinkageFree2015 (v 0.4-7)
R RecordLinkage Package
Text filesNoSoundex, German language algorithm pho_h, JaroWinkler, Levenshtein
RELAISFree2011 (v 2.2.1)
Text filesNoSoundex, 3-grams, Dice, Jaro, JaroWinkler, Levenshtein