Records deduplication in Pandas
How many times have you found yourself in a situation where you had to deal messy data, especially reconciliate mispellings, short forms of popular names, leading/trailing blanks and so on? I’m talking about something like this:
Does this look familiar? It is A-N-N-O-Y-I-N-G, R-I-G-H-T?
I have a good suggestion for you: pandas-dedupe! This is a python library which makes deduplication and canonicalization of medium/large dataframes easy in Pandas.
In the rest of the blog post I’ll first go through the intuition around dedupe and then I’ll show you all the steps to deduplicate records in python pandas. This is Part I of a series of blog posts discussing the pandas-dedupe library. Part II is about records linkage whereas Part III covers gazetteer deduplication.
[the intuition] How deduplication _works_
In broad terms, dedupe constists of three main blocks:
- Define similarity metric.
Dedupe uses affine gap distance as the similarity metric to estimate the distance between two records. - Identify similar records in the dataset.
This is done by supplying dedupe with pairs of records that users have labelled as being duplicates or distinct. From this, dedupe will learn a set of weigths using logistic regression which help to translate the similarity metric into the probability of two records being duplicates. - Group all similar records together.
Pairs of duplicates have to be transformed into groups (or clusters) of duplicate records. This is done by using hierarchical clustering with centroid whereby candidate records that lie within a certain distance from the centroid are duplicates.
If you want to learn more about dedupe inner workings, have a look at this detailed blog post taken from dedupe.io.
[the code] How deduplication _is done_
First, let’s install pandas-dedupepip install pandas-dedupe
Note: there is a dash (‘-’) :)
Second, let’s import the relevant libraries and get some data:
It is a very simple dataset which consists of two variables: ‘city’ and ‘country’. However, many of the city names have been misspelled. Some of the mistakes include simple typos (e.g. ‘Saettle’ and ‘Seatle’), leading and trailing blanks (e.g. ‘New York’ and ‘ New York’) but also city names in different languages (e.g. ‘Milan’ and ‘Milano’).
Third, let’s start the deduplication process
At this point, users are asked to manually label a sample of records — not too much … just some of them :) The UI looks simple and intuitive, as follows:
Users have now the chance to compare examples and decide whether:
- records are the same (i.e. press ‘y’);
- records are different (i.e. press ’n’);
- they are unsure. In this case, they can press ‘u’ and dedupe will skip this record;
- they are done labelling (i.e. press ‘f’) [YayyyyyY!!!];
- Ohh! And if they realise they made a mistake, they can press ‘p’ to go back to the previous example.
Once users are done labelling, dedupe will take care of the rest: assign record to clusters (‘cluster id’), including reporting confidence in the grouping (‘confidence’) and, where relevant, canonicalize records (‘canonical_city’). Here is how the output looks like:
All examples have been assigned to a sensible cluster and canonicalized correctly.
Note: You migh be wondering how dedupe picks the ‘canonical_city’ name. This is the city name that appears with the highest frequency in each cluster. To illustrate, the record ‘Milano’ has been canonicalized to ‘Milan’ since this is the most frequent city name in cluster 3.
And that is it! Personally, I find pandas-dedupe incredibly helpful and cool. Hope you find it useful too!! ! Let me know if you have any comments.
(:
[Geek note: the post is based on pandas-dedupe v1.4.0]