Probabilistic Linkage

What is probabilistic linkage?

Probabilistic linkage, is a method of record linkage, which is the process of identifying data records that refer to the same person, object, or event across different data sources.  Record linkage has many practical applications; it is used most frequently to create master files for populations or to extend record-level information in one data set with information from another.  In hotspotting, record linkage is used to link patient information across different hospitals so that ED and inpatient use can be analyzed.

Modern computer-based record linkage methods fall into two categories, deterministic linkage (read more) and probabilistic linkage (sometimes called fuzzy matching).

Deterministic Linkage

Deterministic linkage identifies a match between records by comparing unique i dentifiers or using an exact comparison between fields.  Deterministic linkage is not the best option if your data has quality issues.

Probabilistic Linkage

Probabilistic linkage calculates a total score for two records to determine how likely it is that both refer to the same individual.  The total score is the sum of scores generated by the comparison of individually weighted fields.  Based on the total score, record pairs will fall above or below predetermined thresholds and will be identified as matches, non-matches, or uncertain matches.

What probabilistic linkage software is right for me?

LinkageWiz SureMatch DeDupe4Excel DataSet V Suite LinkPlus Febrl
Cost $3000 Free Free
Importing data Straightforward Import wizard – can be confusing because some options are for advanced users and are rarely relevant
Strengths Small graphical user interface (GUI) footprint; reasonable online help; good control over matching process. Very clear reporting of results;  understandable options for sufficiently configuring tool; balancing well the tool’s ease of use and learning curve with the user’s degree of control over the process Extremely easy to use (2 mouse clicks); very clear reporting of results; very inexpensive; takes advantage of widespread familiarity of Excel to lessen learning curve. Very inexpensive for a stand-alone tool; user has a lot of control over the manner in which data are interpreted and matches are reviewed and edited.
Weaknesses Slightly convoluted process for dealing with output; implied process for configuring tool not obvious via user interface. Unnecessarily mixes different levels of abstraction in the user interface by having advanced options that require knowledge of SQL intermingled with mainstream functionality for non-technical user; requires primary key to be present in data where it could easily be generated; few export options. Limited control over matching process; tied to Excel exclusively; rigid in file format. Somewhat cluttered and unintuitive user interface; no online help; user manual is comprehensive, but does not explain functionality clearly in all cases.
Price $3,495 for Enterprise Edition, unlimited records; $1,995 for up to 500,000 records. $3,684 for Professional Edition; $7,377 for Enterprise Edition (2 sources allowed); $11,071 for Corporate Edition (VB scripting allowed); all pricing given is for single user, with discounts up to 30% for simultaneous purchase of additional users; unlimited records. $349 per user, and each user must have a copy of MS Excel installed. $750 per installation, unlimited users, unlimited records.  This represents an excellent value, given the features of the tool. Free
Compatibility Access; Dbase; Paradox; Excel; text


What is the probabilistic linkage process?

Clean your data

  1.      Select variables that you will use in the matching process.  These are the variables you will want to prioritize for this process.
  2.      Create new fields for each variable you will be cleaning.  Retaining the unedited original fields allow you to link the newly-generated match ids back to the original datasets.
  3.      This includes parsing variables into component parts: break down the name variable into first name, last name 1, last name 2, middle initial, and suffix.
  4.      Standardize matching variable formats.
  5.      Date of birth and social security numbers are often formatted differently across different hospitals.  Strip SSNs of dashes and make sure dates are in the date format.
  6.      ID numbers should be formatted as text, not numbers, to retain leading zeros.
  7.       Identify hospitals by adding the hospital’s first letter to MRNs.
  8.      Clean out obviously invalid data.
  9.      Refer to this module’s section on data cleaning to create a framework that works with your data.
  10.      Group on all fields that will be imported into the linkage software.  This will remove exact duplicates and speed the matching process.

Prepare your data for import

  1.      Minimize variable widths.  Query to find the maximum length of each field and resize it accordingly.
  2.      Create synthetic / analytic variables [e.g. new sub-string variables if partial values are common]
  3.      Partition the tables to create one with only matching variables and unique record ID

If you are using LinkageWiz, your data is ready to import

LinkageWiz Steps

1)      Create a new project

2)      Import Data

–   You have the option of importing data in one table or multiple.

–   You will need to specify the data type for each variable.  If you are importing multiple tables, it is important to keep the data type of the same variables consistent.

3)      Navigate to Define Match Variables -> Assign those variables that correspond to preset variables in LinkageWiz.  You may need to add categories for unique identifiers used by your hospital system.

4) Define weights for match variables without existing roles

5) Assess the risks and cases for false positives and false negatives

6)  Set blocking variables

7) Run the match

8) Validate at the top and around the threshold

9) Manually check subsets

10) Tweak, iterate

11) Assess the non-overlap in the matched population between iterations

12) Tweak, iterate

13) Manually review high priority groups / sample and determine an cceptable estimate error rate

14) Manually split/match high priority groups / sample to achieve desired error rate and determine minimally burdensome rules for manual split/match