🧱 Document Structures#

Document structures are also important to consider when building a document understanding system. The structure of a document can be defined as the way in which the information is organized and presented.

Depending on document structured and complexity, different approaches and techniques may be required to extract information from documents.

In our context, there are three main types of document structures:

Document Structures

Structured#

Key characteristics
  • Fixed page format

  • Identifies where and what to enter

  • Areas for data entry are clearly defined and labeled (e.g. textbox, checkbox, etc.)

  • Fields have one-to-one mapping with values (e.g. Account Number)

Examples
  • Tax forms

  • Identification cards

  • Application forms

Semi-structured#

Key characteristics
  • No fixed page format

  • Information is usually grouped in a logical manner

Examples
  • Invoices

  • Receipts

  • Purchase orders

Unstructured#

Key characteristics
  • Little to no organization

  • Continous, verbose, text-heavy content

  • Information is can be communicated in sentence or paragraph

  • Complex for non-subject matter experts to read and understand

Examples
  • Contracts

  • Legal documents

  • Medical records