Skip to content

Document Chunk

Introduction

Lettria Parser API split the document's content into several chunks by analyzing text indentation, fonts, text size, text style, page composition (columns, list, etc.), tables and specific characters. The all document is returned as a list of chunks, each chunk corresponding to a paragraph.

Format

A document chunk has the following format:

KeyTypeDescription
idintThe chunk identifier.
typestringThe type of chunk : "text" or "table".
contentdepending on the type key.A string for text chunks, and a list or a dictionary for table chunks (depending on the detected type of table).
metadatadictionaryA dictionnary containing additionnal informations on the chunk, depending on the motor that extracted the chunk (OCR, STT, etc.)

Next steps