Talk

Conquering PDFs: document understanding beyond plain text

Saturday, May 31

11:05 - 11:35
RoomTortellini
LanguageEnglish
Audience levelIntermediate
Elevator pitch

NLP and data science could be so easy if all our data came as clean and plain text. But in practice, a lot of it is hidden away in PDFs and other formats. I’m presenting a new approach for building robust document understanding systems, using state-of-the-art models and the awesome Python ecosystem.

Abstract

NLP and data science could be so easy if all of our data came as clean and plain text. But in practice, a lot of it is hidden away in PDFs, Word documents, scans and other formats that have been a nightmare to work with. In this talk, I’ll present a new and modular approach for building robust document understanding systems, using state-of-the-art models and the awesome Python ecosystem. I’ll show you how you can go from PDFs to structured data and even build fully custom information extraction pipelines for your specific use case.

For the practical examples, I’ll be using spaCy, and the new Docling library and layout analysis models. I’ll also cover Optical Character Recognition (OCR) for image-based text, how to convert tabular data to pandas DataFrames, and strategies for creating training and evaluation data for information extraction tasks like text classification and entity recognition using PDFs and other documents as inputs.

TagsNatural Language Processing, Data Engineering, Computer Vision
Participant

Ines Montani

Ines Montani is a developer specializing in tools for AI and NLP technology. She’s the co-founder and CEO of Explosion and a core developer of spaCy, a popular open-source library for Natural Language Processing in Python, and Prodigy, a modern annotation tool for creating training data for machine learning models.