Extracting document metainfo

This demonstration shows and explains the usage of the get_document_info() method, that extracts meta info from the document.

Introduction

In some situations it is required to grab meta info from a document before actually editing it. For example, the user wants to edit the last tab of a multi-tabbed spreadsheet, but he doesn’t know how many tabs the document contains. Or it is unclear for the user whether the document is password-protected or not. For such situations GroupDocs.Editor provides a get_document_info() method, that returns detailed meta info (metadata) about the specified document.

Using the method

In order to grab the meta info from a document, it should firstly be loaded into the Editor class. Then get_document_info() should be called. This method accepts one optional parameter — the password as a string. If the document is encoded and the user knows the password, he can specify it here. For other cases, the password can be omitted. The code example below demonstrates the usage:

from groupdocs.editor import Editor

with Editor("document.docx") as editor:
    info_without_password = editor.get_document_info()
    info_with_password = editor.get_document_info(password="password")

There can be several scenarios here regarding whether the document is encoded or not, and whether the user specified a password:

  1. If a password is specified, but the document is not password-protected, or the document format doesn’t support encoding at all, the password will be ignored.
  2. If the document is password-protected, but a password is not specified, the PasswordRequiredException will be thrown while calling get_document_info().
  3. If the document is password-protected, and a password is specified, but it is incorrect, the IncorrectPasswordException will be thrown while calling get_document_info().

Explaining the resulting type

The get_document_info() method returns a lightweight view of the document metadata. It supports snake_case attribute access as well as dict-style access for the underlying PascalCase keys. It contains the next properties:

  1. page_count. This is a positive number, that returns the page count for WordProcessing, PDF and XPS documents, tabs (worksheets) count for Spreadsheets, slides count for Presentations and a number 1 for pageless documents like XML or TXT.
  2. size. The document size in bytes.
  3. is_encrypted. A boolean flag that indicates whether the document is encrypted with a password or not. If the document is of a type that doesn’t support encryption at all, like CSV or XML, this property always returns False.
  4. format. Returns info about the format itself.

Internally GroupDocs.Editor provides a dedicated metadata type for every family format, all of which expose the four properties above:

  1. WordProcessingDocumentInfo — common for all WordProcessing family formats.
  2. SpreadsheetDocumentInfo — common for all Spreadsheet family formats.
  3. PresentationDocumentInfo — common for all Presentation family formats.
  4. TextualDocumentInfo — common for all textual types, including all DSV (like CSV and TSV), XML, HTML, and plain text.
  5. FixedLayoutDocumentInfo — common for all documents with a fixed-layout format, this includes only PDF and XPS.
  6. EmailDocumentInfo — common for all Email family formats, like EML, MSG, VCF, PST, MBOX and others.
  7. EbookDocumentInfo — common for all eBook family formats like MOBI and ePub.
  8. MarkdownDocumentInfo — a special type dedicated especially to the Markdown (MD) textual format.

One important thing to note: if get_document_info() returns a None value, this means that the specified document is not supported by GroupDocs.Editor and thus cannot be opened for editing or saved.

Explaining the document format

The metadata view contains a format property. The format descriptor indicates one particular document format and stores the format name, extension, and MIME-code. It delivers the next properties:

  1. name — provides the name of the format.
  2. extension — provides the format extension.
  3. mime — provides the MIME-code for the particular format.
  4. format_family — provides the family format the format belongs to.

The format descriptors are grouped by family:

  1. WordProcessingFormats — holds all formats from the WordProcessing family.
  2. SpreadsheetFormats — holds all formats from the Spreadsheet family.
  3. PresentationFormats — holds all formats from the Presentation family.
  4. TextualFormats — holds all formats with a text-based nature.
  5. FixedLayoutFormats — holds all formats from the fixed-layout family. This includes only PDF and XPS.
  6. EBookFormats — holds all eBook (Electronic book) formats like Mobi and ePub.
  7. EmailFormats — holds all email (electronic mail) formats like EML and MSG.

Complete code example

The example below loads a document, reads its metadata without performing a full edit pass, and prints the most useful fields.

import os
from groupdocs.editor import Editor, License

def extracting_document_metainfo():
    # Optionally set a license
    license_path = os.path.abspath("./GroupDocs.Editor.lic")
    if os.path.exists(license_path):
        License().set_license(license_path)

    # Load the document and read its metadata
    with Editor("./sample-document.docx") as editor:
        info = editor.get_document_info()

        print("Format:", info.format.name)
        print("Extension:", info.format.extension)
        print("MIME:", info.format.mime)
        print("Pages:", info.page_count)
        print("Size, bytes:", info.size)
        print("Encrypted:", info.is_encrypted)

if __name__ == "__main__":
    extracting_document_metainfo()

sample-document.docx is the sample file used in this example. Click here to download it.

Format: Office Open XML WordProcessingML Macro-Free Document (DOCX)
Extension: docx
MIME: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Pages: 3
Size, bytes: 49455
Encrypted: False

Download full output