Gazetteer deduplication in Pandas
Gazetteer deduplication is for matching a messy data set against a ‘canonical’ dataset (i.e. gazette). The former contains misspellings, typos, leading/trailing blanks, whereas the latter must be clean and well formatted. The goal is to match records between the two sources so that each mispelt entry can be replaced (i.e. canonicalized) with its correct form in the gazette.
An examples of gazetteer deduplication is the situation where a messy list of city names reported by online users are cleaned up by being matched against a clean list of names taken from a trusted domain like Open Street Maps.
In this tutorial, I’ll go through an example showing how to perform gazetteer deduplication using pandas-dedupe, a python library built on top of dedupe. This is my third tutorial about pandas-dedupe. If you wish to learn more about the inner workings of this library as well as its parent dedupe.io, please refer to part I — Records deduplication in Pandas (see ‘[the intuition]’ section) or the excellent dedupe.io documentation.
[the code] How gazetter deduplication _is done_
First, to install pandas-dedupe for gazetteer deduplication
pip install git+https://github.com/Lyonk71/pandas-dedupe.git
Second, let’s import the relevant libraries and get some data:
We have two main dataframe:
- clean_data: this is our gazette. It contains a clean and well-formatted list of street names.
Note: the gazette dataframe should consist of only a single variable.
- messy_data — this is the dirty dataframe we want to deduplicate and canonicalize. It is a simple dataframe containing two variables: ‘street_name’ and ‘city’. The variable ‘street_name’ appears to contain many typos that we need to clean up.
Third, let’s start the deduplication process
At this point, users are asked to manually label a sample of records by comparing entries in clean_data against records in messy_data and establishing whether examples are the same (‘y’) or distinct (’n’). While labelling, users can skip records by pressing ‘u’ or go back to the previous one (‘p’) in case they made a mistake. Finally, they can press ‘f’ to terminate the task. The UI looks like the following:
Once users are done labelling, dedupe will take care of the rest: assign record to clusters (‘cluster id’), including reporting confidence in the grouping (‘confidence’) and canonicalize records (‘canonical_street_name’).
Here is how the output looks like:
All examples have been assigned to a sensible cluster with reasonable confidence and canonicalized correctly.