Gazetteer deduplication in PandasGazetteer deduplication is for matching a messy data set against a ‘canonical’ dataset (i.e. gazette). The former contains misspellings…Dec 19, 2020Dec 19, 2020
Record linkage in PandasRecord linkage is the process of linking records from different data sources (e.g. pandas dataframes) using any fields in common between…Dec 15, 2020A response icon1Dec 15, 2020A response icon1
Records deduplication in PandasHow many times have you found yourself in a situation where you had to deal messy data, especially reconciliate mispellings, short forms…Nov 20, 2020Nov 20, 2020
Flatten nested dictionaries in pandas using glomPandas is great! You can do pretty much eveything with it: from data cleaning to quick data viz. How about working with nested dictionary…Jun 23, 2020Jun 23, 2020
The customized spaCy training loopCustomization and implementation of tips and advice for NER trainingMay 10, 2020A response icon1May 10, 2020A response icon1
How to create training data for spaCy NER models using ipywidgetsIn this post, I present the spacy-annotator: a library to create training data for the spaCy Named Entity Recognition (NER) model using…May 3, 2020A response icon2May 3, 2020A response icon2
Fake reviews detection and transfer learningWe apply the Universal Language Model Fine-Tuning(ULMFiT) by Howard and Ruder (2018) to fake reviews detection and demonstrate that deep…Jan 4, 2020Jan 4, 2020