Records linkage in Pandas
Record linkage is the process of linking records from different data sources (e.g. pandas dataframes) using any fields in common between them.
In this blog post, I’ll talk you through linking records in pandas dataframe using pandas-dedupe. This is a python library build on top of dedupe.io. which links records from two different pandas dataframes by using a combination of active learning, logistic regression and hierarchical clustering.
This is my second tutorial about pandas-dedupe. If you wish to learn more about the inner workings of this library as well as its parent dedupe.io, please refer to part I — Records deduplication in Pandas (see ‘[the intuition]’ section) or the excellent dedupe.io documentation.
[the code] How record linkage _is done_
First, let’s install pandas-dedupe.
pip install pandas-dedupe
Note: there is a dash (‘-’) :)
Second, let’s get some data and import it in Pandas.
For this example, I downloaded a subset of some noisy string data, which represent persons’ names. The data is publicly available at the following link.
It is a very simple dataset which consists of one variables (i.e. ‘name’) for a total of 150 full names which may or may not be misspelled.
The data looks like this:
A. Shuffle the data
B. Split shuffled data in left and right dataframes.
These dataframe will be our two data sources whose records we will link tegether.
C. Link records
At this point, users are asked to manually label a sample of records which are going to guide pandas-dedupe in the identification of distinct and similar records. When labelling starts, the following UI will pop out:
Users have now the chance to compare examples and decide whether [i] records are the same (‘y’), [ii] records are distinct (‘n’), [iii] they are usure(‘u’) or [iv] they are done labelling (‘f’). And if they made a mistake, they can always go back to the previous example by pressing ‘p’.
Once labelling is over, dedupe will take care of the rest: assign record to clusters (‘cluster id’), including reporting confidence in the linkage (‘confidence’). Note that in case it was not possible to assign the record to a cluster, pandas-dedupe will return ‘N/A’ for the ‘cluster id’ variable.
Here is how the output looks like:
And that is it!
Record linkage can be very time consuming. Hope this helps to speed you up a little (: