Open in app

Sign In

Write

Sign In

Enrico Alemani
Enrico Alemani

20 Followers

Home

About

Dec 19, 2020

Gazetteer deduplication in Pandas

Gazetteer deduplication is for matching a messy data set against a ‘canonical’ dataset (i.e. gazette). The former contains misspellings, typos, leading/trailing blanks, whereas the latter must be clean and well formatted. The goal is to match records between the two sources so that each mispelt entry can be replaced (i.e…

Pandas

3 min read

Gazetteer deduplication in Pandas
Gazetteer deduplication in Pandas
Pandas

3 min read


Dec 15, 2020

Records linkage in Pandas

Record linkage is the process of linking records from different data sources (e.g. pandas dataframes) using any fields in common between them. In this blog post, I’ll talk you through linking records in pandas dataframe using pandas-dedupe. This is a python library build on top of dedupe.io. …

3 min read

Record linkage in Pandas
Record linkage in Pandas

3 min read


Nov 20, 2020

Records deduplication in Pandas

How many times have you found yourself in a situation where you had to deal messy data, especially reconciliate mispellings, short forms of popular names, leading/trailing blanks and so on? I’m talking about something like this: Does this look familiar? It is A-N-N-O-Y-I-N-G, R-I-G-H-T? I have a good suggestion for you…

3 min read

Records deduplication in Pandas
Records deduplication in Pandas

3 min read


Jun 23, 2020

Flatten nested dictionaries in pandas using glom

Pandas is great! You can do pretty much eveything with it: from data cleaning to quick data viz. How about working with nested dictionary from a json file? pandas.json_normalize can do most of the work for you (most of the time). However, json_normalize gets slow when you want to flatten…

Pandas

2 min read

Pandas

2 min read


May 10, 2020

The customized spaCy training loop

Customization and implementation of tips and advice for NER training In this post, I explain how to customize the spaCy Named Entity Recognition (NER) training loop from the comfort of your jupyter notebook, including the implementation of spaCy tips and advice on performance optimization. NB: the code snippets use spaCy…

NLP

2 min read

The customized spaCy training loop
The customized spaCy training loop
NLP

2 min read


May 3, 2020

How to create training data for spaCy NER models using ipywidgets

spacy-annotator is a library to create training data for spaCy Named Entity Recognition (NER) model using ipywidgets. The annotator is fast and, most importantly can leverage existing spaCy models to label your data and pre-fill the annotator for you, even Transformers! GitHub: https://github.com/ieriii/spacy-annotator …

Spacy

3 min read

How to create training data for spaCy NER models using ipywidgets
How to create training data for spaCy NER models using ipywidgets
Spacy

3 min read


Jan 4, 2020

Fake reviews detection and transfer learning

We apply the Universal Language Model Fine-Tuning(ULMFiT) by Howard and Ruder (2018) to fake reviews detection and demonstrate that deep transfer learning outperforms previously researched statistical techniques by Ott, Cardie and Hancock (2013) as well as a standard neural architecture. Additionally, we make several theoretical contributions, including showing that our…

NLP

7 min read

Fake reviews detection and transfer learning
Fake reviews detection and transfer learning
NLP

7 min read

Enrico Alemani

Enrico Alemani

20 Followers

Data Scientist | Economist. Curious.

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech