Skip to content

Document Metadatas

Lettria Parser API return metadatas at the document level.

name : the original file name. document_type : the input type sent in request.

Document structure

document_structure : a dictionnary representing the parent/child taxonomy between chunks. The id key in each chunk is used as reference. The document's structure follow this format :

{
	"1" : {				# id of the parent chunk
		"children": [	# list of children chunks ids
			2,
			3
		]
	},
}

Document processed by STT

speakers : list of speaker informations. For each elment in the speakers list :

id : speaker indentifier. chunks_ids : list of chunk ids corresponding to this speaker. sentence_index : list of sentence indexes. (For Lettria API Comphrehension connexion) infos : dictionnary of extra informations comming from the STT API used.

Speech to text API informations :

allomedia_metadata : metadatas from Allomedia API. gladia_metadata : metadatas from Gladia API.

Chunk Metadatas

Lettria Parser API return metadatas at chunk level.

Format for chunks metadata key :

Text Chunk

page : index of the page in which the chunk is located. words : list of dictionaries corresponding to words within the chunk. lines : index of the lines that the chunk represents.

Text metadatas

start_index : The text index where the chunk starts in the document.

Words metadatas

For each element inside the words list :

content : the word's text. font : the word's font. indexes : starting and ending indexes for this word.

Audio metadatas

Informations coming from STT :

speaker : speaker identifier for this chunk. start_time : start timestamp in the original audio file. end_time : end timestamp in the original audio file.

Geometry metadatas

Informations coming from OCR or PDF extraction :

For each element inside the words list :

left : left position of the word. right : right position of the word. top : top position of the word. bottom : bottom position of the word. width : difference between left and right. height : difference between top and bottm.

List metadatas

A chunk part of a list :

is_list : True if the chunk is a list element.

HTML metadatas

A chunk extracted from html :

html_tags : list of remove html tags for this chunk.

Table Chunk

page : index of the page in which the chunk is located. sub_type : table sub type. name : table name (if found). left : table left position. right : table right position. top : table top position. bottom : table bottom position. width : difference between left and right. height : difference between top and bottom. module : python module used for table extraction (tabula or pdfplumber)

Table Sub Types

In Lettria Parser API we analyse the cells of each table to try to understand what kind of table we have. The content of the table is processed and returned differently in function of the detected table sub type

  • Key value Table : a two columns table in which first column is the key and the second one is the value. Table's content will be returned as a list of dictionaries {key:value} correspondind to each line.

  • Title Line Table : a table in which the first line contains the column's names. Table's content will be return as a list of dictionaries corresponding to each line (except the first one). The dictionnary keys will depend on the first line's cells.

  • Title Column Table : a table in which the first column contains the line's names. Table's content will be return as a list of dictionaries corresponding to each column (except the first one). The dictionnary keys will depend on the first column's cells.

  • Double Entry Table : a table in which the first line and the first column contain the cell's names. Table's content will be return as a list of dictionaries corresponding to each cell (except the ones in first column and first line). The dictionnary keys will depend on the first column and the first line's cells.

  • Double Entry Table Special : same as above but first line's cells and first column's cells are the same. Therefore, the cells inside the diagonal will be skiped.

  • Other Table (Default) : this is the default table sub type when no other sub type as been found. The table's content will simply be returned as a list of lines. Each line beeing a list of dictionary corresponding to each cell inside the line.