Extracting document metainfo Leave feedback

Introduction

In some situations it is required to grab meta info from a document before actually editing it. For example, the user wants to edit the last tab of a multi-tabbed spreadsheet, but he doesn’t know how many tabs the document contains. Or it is unclear for the user whether the document is password-protected or not. For such situations GroupDocs.Editor provides a get_document_info() method, that returns detailed meta info (metadata) about the specified document.

Using the method

In order to grab the meta info from a document, it should firstly be loaded into the Editor class. Then get_document_info() should be called. This method accepts one optional parameter — the password as a string. If the document is encoded and the user knows the password, he can specify it here. For other cases, the password can be omitted. The code example below demonstrates the usage:

from groupdocs.editor import Editor

with Editor("document.docx") as editor:
    info_without_password = editor.get_document_info()
    info_with_password = editor.get_document_info(password="password")

There can be several scenarios here regarding whether the document is encoded or not, and whether the user specified a password:

If a password is specified, but the document is not password-protected, or the document format doesn’t support encoding at all, the password will be ignored.
If the document is password-protected, but a password is not specified, the PasswordRequiredException will be thrown while calling get_document_info().
If the document is password-protected, and a password is specified, but it is incorrect, the IncorrectPasswordException will be thrown while calling get_document_info().

Explaining the resulting type

The get_document_info() method returns a lightweight view of the document metadata. It supports snake_case attribute access as well as dict-style access for the underlying PascalCase keys. It contains the next properties:

page_count. This is a positive number, that returns the page count for WordProcessing, PDF and XPS documents, tabs (worksheets) count for Spreadsheets, slides count for Presentations and a number 1 for pageless documents like XML or TXT.
size. The document size in bytes.
is_encrypted. A boolean flag that indicates whether the document is encrypted with a password or not. If the document is of a type that doesn’t support encryption at all, like CSV or XML, this property always returns False.
format. Returns info about the format itself.

Internally GroupDocs.Editor provides a dedicated metadata type for every family format, all of which expose the four properties above:

WordProcessingDocumentInfo — common for all WordProcessing family formats.
SpreadsheetDocumentInfo — common for all Spreadsheet family formats.
PresentationDocumentInfo — common for all Presentation family formats.
TextualDocumentInfo — common for all textual types, including all DSV (like CSV and TSV), XML, HTML, and plain text.
FixedLayoutDocumentInfo — common for all documents with a fixed-layout format, this includes only PDF and XPS.
EmailDocumentInfo — common for all Email family formats, like EML, MSG, VCF, PST, MBOX and others.
EbookDocumentInfo — common for all eBook family formats like MOBI and ePub.
MarkdownDocumentInfo — a special type dedicated especially to the Markdown (MD) textual format.

One important thing to note: if get_document_info() returns a None value, this means that the specified document is not supported by GroupDocs.Editor and thus cannot be opened for editing or saved.

Explaining the document format

The metadata view contains a format property. The format descriptor indicates one particular document format and stores the format name, extension, and MIME-code. It delivers the next properties:

name — provides the name of the format.
extension — provides the format extension.
mime — provides the MIME-code for the particular format.
format_family — provides the family format the format belongs to.

The format descriptors are grouped by family:

WordProcessingFormats — holds all formats from the WordProcessing family.
SpreadsheetFormats — holds all formats from the Spreadsheet family.
PresentationFormats — holds all formats from the Presentation family.
TextualFormats — holds all formats with a text-based nature.
FixedLayoutFormats — holds all formats from the fixed-layout family. This includes only PDF and XPS.
EBookFormats — holds all eBook (Electronic book) formats like Mobi and ePub.
EmailFormats — holds all email (electronic mail) formats like EML and MSG.

Complete code example

The example below loads a document, reads its metadata without performing a full edit pass, and prints the most useful fields.

extracting_document_metainfo.py

import os
from groupdocs.editor import Editor, License

def extracting_document_metainfo():
    # Optionally set a license
    license_path = os.path.abspath("./GroupDocs.Editor.lic")
    if os.path.exists(license_path):
        License().set_license(license_path)

    # Load the document and read its metadata
    with Editor("./sample-document.docx") as editor:
        info = editor.get_document_info()

        print("Format:", info.format.name)
        print("Extension:", info.format.extension)
        print("MIME:", info.format.mime)
        print("Pages:", info.page_count)
        print("Size, bytes:", info.size)
        print("Encrypted:", info.is_encrypted)

if __name__ == "__main__":
    extracting_document_metainfo()

sample-document.docx

sample-document.docx is the sample file used in this example. Click here to download it.

extracting-document-metainfo.txt

Format: Office Open XML WordProcessingML Macro-Free Document (DOCX)
Extension: docx
MIME: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Pages: 3
Size, bytes: 49455
Encrypted: False

Download full output

We value your opinion. Your feedback will help us improve our documentation.

Extracting document metainfo Leave feedback

On this page

Introduction

Using the method

Explaining the resulting type

Explaining the document format

Complete code example

Was this page helpful?

Any additional feedback you'd like to share with us?

Please tell us how we can improve this page.

Thank you for your feedback!

On this page