Gazetteer deduplication in PandasGazetteer deduplication is for matching a messy data set against a ‘canonical’ dataset (i.e. gazette). The former contains misspellings…Dec 19, 2020Dec 19, 2020
Record linkage in PandasRecord linkage is the process of linking records from different data sources (e.g. pandas dataframes) using any fields in common between…Dec 15, 20201Dec 15, 20201
Records deduplication in PandasHow many times have you found yourself in a situation where you had to deal messy data, especially reconciliate mispellings, short forms…Nov 20, 2020Nov 20, 2020
Flatten nested dictionaries in pandas using glomPandas is great! You can do pretty much eveything with it: from data cleaning to quick data viz. How about working with nested dictionary…Jun 23, 2020Jun 23, 2020
The customized spaCy training loopCustomization and implementation of tips and advice for NER trainingMay 10, 20201May 10, 20201
How to create training data for spaCy NER models using ipywidgetsIn this post, I present the spacy-annotator: a library to create training data for the spaCy Named Entity Recognition (NER) model using…May 3, 20202May 3, 20202
Fake reviews detection and transfer learningWe apply the Universal Language Model Fine-Tuning(ULMFiT) by Howard and Ruder (2018) to fake reviews detection and demonstrate that deep…Jan 4, 2020Jan 4, 2020