Document Chunk

Introduction

Lettria Parser API split the document's content into several chunks by analyzing text indentation, fonts, text size, text style, page composition (columns, list, etc.), tables and specific characters. The all document is returned as a list of chunks, each chunk corresponding to a paragraph.

Format

A document chunk has the following format:

Key	Type	Description
`id`	`int`	The chunk identifier.
`type`	`string`	The type of chunk : "text" or "table".
`content`	depending on the `type` key.	A `string` for text chunks, and a `list` or a `dictionary` for table chunks (depending on the detected type of table).
`metadata`	`dictionary`	A dictionnary containing additionnal informations on the chunk, depending on the motor that extracted the chunk (OCR, STT, etc.)

Next steps

Metadata