Introduction

The Lettria parsing feature is a normalizing tool that processes file bytes, generating a JSON object corresponding to the document's content. The API returns the full content in plain text, a list of document's parts named chunks and other metadatas depending on the document's type.

Whether you aim to organize unstructured data, extract valuable information, or enhance interoperability through RDF exports, Lettria's parser endpoint provides a text standardization solution, essential to all kind of NLP projects.

Supported input formats :

txt, docx, odt, html, pdf, jpg, jpeg, png, webp, mp3, json, xlsx, xls, csv

Cleaning features

The Lettria Parser API also provides extra cleaning features :

Removing html tags to retreive the text inside.
Removing page number.
Removing headers and footers.
Removing vertical words.
Removing text references.
Removing whitespaces.
Replacing special characters.
Repairing Speech to text (STT) errors.
Repairing superscript and subscript characters.
Rebuilding list.
Rebuilding columns.
Separating tables and text.

External python modules used :

beautifulsoup4
docx2python
pdfplumber
tabula-py
openpyxl
nnsplit
doctr
odfpy
json
csv
re

External API used for STT:

Allomedia
Gladia

Next step

Authentication