πŸ“‚ Document File Formats#

Documents can come in a multitude of electronic file formats. The following is a list of the most common file types for document understanding systems:

  • Portable Document Format (.pdf) - file format developed by Adobe that ensures documents can be viewed and printed consistently across different devices and platforms. PDFs preserve the formatting, fonts, and graphics of the original document, making them widely used for sharing and distributing files electronically.

  • Joint Photographic Experts Group (.jpeg/.jpg) - a widely used image compression standard developed by the committee of the same name. JPEG allows for efficient compression of digital images while maintaining acceptable image quality. It is commonly used for sharing and displaying photographs on the web and in various digital media.

  • Portable Network Graphics (.png) - image file format known for its lossless compression and support for transparent backgrounds. It is widely used on the web for graphics and illustrations due to its ability to maintain high image quality while minimizing file size. PNG files are commonly used for logos, icons, and images that require crisp edges and transparency.

  • Tagged Image File Format (.tiff/.tif) - file format for storing and exchanging raster graphics images. It supports lossless compression and can store high-quality images with multiple color depths and layers. TIFF files are commonly used in professional applications such as photography, printing, and graphic design.

  • Text (.txt) - commonly known as plain text, is a simple and universal format for storing textual data. It contains unformatted text and can be opened and edited by a wide range of software applications, making it highly versatile for storing and sharing information. .txt files are commonly used for note-taking, scripting, and storing textual data that does not require complex formatting.

  • HyperText Markup Language (.html) - the standard markup language used for creating and structuring web pages. It defines the structure and layout of content on a webpage using tags and elements, such as headings, paragraphs, links, and images. HTML is the backbone of the World Wide Web and is interpreted by web browsers to render and display web pages to users.

  • Document Open XML (.docx) - file format used for storing and exchanging documents created by Microsoft Word and other word processing applications. It is based on XML (eXtensible Markup Language) and contains structured data that defines the document’s content, formatting, and layout. Document Open XML files offer improved compatibility, smaller file sizes, and support for advanced features like multimedia elements and custom document properties.

  • Excel Open XML Spreadsheet (.xlsx) - file format used by Microsoft Excel and other spreadsheet software for storing and manipulating spreadsheet data. It is based on the Open XML standard and contains structured data that defines worksheets, formulas, formatting, and other spreadsheet elements. XLSX files offer improved compatibility, smaller file sizes, and support for advanced features like conditional formatting, data validation, and macros.