Creating a Next-Generation Financial Dataset from Scratch with NLP and Active Learning

I was delighted to be invited to give a talk at the inaugural spaCy IRL conference in Berlin, Germany. spaCy IRL was a gathering of researchers and practitioners pushing the boundaries of industrial-strength natural language processing with the spaCy software library and ecosystem.

My talk highlighted one of my group’s projects: using natural language processing and active learning to create a new kind of financial intelligence data. The goal was to assemble a dataset about companies’ environmental, social, and governance practices (“ESG). We used a human-in-the-loop workflow to iteratively train and validate machine learning models to detect mentions of companies’ ESG practices from publicly-available documents. Check out the video to learn more about our methodology.

The slides from the talk are available below.

This post was originally published on