How to create training data for spaCy NER models using ipywidgets

Enrico Alemani
3 min readMay 3, 2020

--

spacy-annotator is a library to create training data for spaCy Named Entity Recognition (NER) model using ipywidgets.

The annotator is fast and, most importantly can leverage existing spaCy models to label your data and pre-fill the annotator for you, even Transformers!

GitHub: https://github.com/ieriii/spacy-annotator
And when you are ready to give it a go:

pip install spacy-annotator

Enjoy! (:

Source: Giphy

I love spaCy. It is awesome.

spaCy gives you a pre-trained model to solve NLP tasks as quick as a flash.

What about training your own model with custom labels?
Yes, you can do that too. However, it is not always a straightforward process. The main reason is that spaCy requires training data to be in a specific format. In particular, the Named Entity Recognition (NER) model requires annotated data, as follows:

( “Free Text”, entities : { [(start,end,“LABEL1”), (start,end,“LABEL2”), (start,end,“LABEL3”)] } )

where “Free Text” is the text containing entities you want to be label; “start”, “end” and “LABEL#” are the characters offsets and the labels assigned to entities respectively.

To create your own training data, spaCy suggests to use the phrasematcher. This matches tokens in a large terminology list with tokens in your free text. Despite being a good starting point, this method does not provide users with control over which token will eventually be labelled in the text. Example:

terminology_list=[‘apple’]
label='fruit'
free_text1 = ‘I ate an apple’
free_text2 = 'I love apple products'

In this example, the token ‘apple’ will be labelled as ‘fruit’ in both examples, although ‘apple’ is not a ‘fruit’ item but rather a ‘company’ in free_text2.

So, how do we fix this?

I developed the spacy-annotator, a simple interface to quickly label entities for NER using ipywidgets. The annotator provides users with (almost) full control over which tokens will be assigned a custom label to in each piece of text. Here is a demo:

In the spacy-annotator, the annotate method requires the user to specify (at least) the following three arguments:

  • df: pandas dataframe;
  • col_text: column in the pandas dataframe containing text to be labelled;
  • labels: list of NER custom labels.

The annotator will then show a UI which includes instructions and a pre-filled template to be completed with one (or a user specified delimiter-separated list of) token(s). Just copy and paste tokens into the template. The annotator will take care of the rest, including the removal of any leading/trailing blanks you might have accidentally inserted. In addition to this, the labelling jobs can be personalised by adding optional keyword arguments, as follows:

  • model: if a spaCy model is passed into the annotator, the model is used to identify existing entities in text and their probabilities. This trick of pre-labelling the example using the current best model available allows for accelerated labelling — also known as of noisy pre-labelling. Wanna go fancy? Feel free to use Transformers!
  • delimiter: user-specified delimiter used when listing entitied in the UI;
  • fraction: option to define the size of a sample drawn from the full dataframe to be annotated;
  • strata: option to define strata in the sampling design.

The output is recorded in a separate ‘annotation’ column of the original pandas dataframe (df) which is ready to serve as input to a SpaCy NER model. And that is it, really!

You can find the library on GitHub: https://github.com/ieriii/spacy-annotator. It also contains a sample code to test it yourself.

Contributions are welcomed. Please read the README.md file on GitHub.
Happy labelling!! !

Changelog

  1. post edited on 18 November 2020 to reflect changes to the spacy-annotator library
  2. post edited on 22 May 2021 to reflect code refactoring to v2

--

--