This demonstration shows and explains the usage of the get_document_info() method, that extracts meta info from the document.
Introduction
In some situations it is required to grab meta info from a document before actually editing it. For example, the user wants to edit the last tab of a multi-tabbed spreadsheet, but he doesn’t know how many tabs the document contains. Or it is unclear for the user whether the document is password-protected or not. For such situations GroupDocs.Editor provides a get_document_info() method, that returns detailed meta info (metadata) about the specified document.
Using the method
In order to grab the meta info from a document, it should firstly be loaded into the Editor class. Then get_document_info() should be called. This method accepts one optional parameter — the password as a string. If the document is encoded and the user knows the password, he can specify it here. For other cases, the password can be omitted. The code example below demonstrates the usage:
There can be several scenarios here regarding whether the document is encoded or not, and whether the user specified a password:
If a password is specified, but the document is not password-protected, or the document format doesn’t support encoding at all, the password will be ignored.
The get_document_info() method returns a lightweight view of the document metadata. It supports snake_case attribute access as well as dict-style access for the underlying PascalCase keys. It contains the next properties:
page_count. This is a positive number, that returns the page count for WordProcessing, PDF and XPS documents, tabs (worksheets) count for Spreadsheets, slides count for Presentations and a number 1 for pageless documents like XML or TXT.
size. The document size in bytes.
is_encrypted. A boolean flag that indicates whether the document is encrypted with a password or not. If the document is of a type that doesn’t support encryption at all, like CSV or XML, this property always returns False.
format. Returns info about the format itself.
Internally GroupDocs.Editor provides a dedicated metadata type for every family format, all of which expose the four properties above:
TextualDocumentInfo — common for all textual types, including all DSV (like CSV and TSV), XML, HTML, and plain text.
FixedLayoutDocumentInfo — common for all documents with a fixed-layout format, this includes only PDF and XPS.
EmailDocumentInfo — common for all Email family formats, like EML, MSG, VCF, PST, MBOX and others.
EbookDocumentInfo — common for all eBook family formats like MOBI and ePub.
MarkdownDocumentInfo — a special type dedicated especially to the Markdown (MD) textual format.
One important thing to note: if get_document_info() returns a None value, this means that the specified document is not supported by GroupDocs.Editor and thus cannot be opened for editing or saved.
Explaining the document format
The metadata view contains a format property. The format descriptor indicates one particular document format and stores the format name, extension, and MIME-code. It delivers the next properties:
name — provides the name of the format.
extension — provides the format extension.
mime — provides the MIME-code for the particular format.
format_family — provides the family format the format belongs to.
TextualFormats — holds all formats with a text-based nature.
FixedLayoutFormats — holds all formats from the fixed-layout family. This includes only PDF and XPS.
EBookFormats — holds all eBook (Electronic book) formats like Mobi and ePub.
EmailFormats — holds all email (electronic mail) formats like EML and MSG.
Complete code example
The example below loads a document, reads its metadata without performing a full edit pass, and prints the most useful fields.
importosfromgroupdocs.editorimportEditor,Licensedefextracting_document_metainfo():# Optionally set a licenselicense_path=os.path.abspath("./GroupDocs.Editor.lic")ifos.path.exists(license_path):License().set_license(license_path)# Load the document and read its metadatawithEditor("./sample-document.docx")aseditor:info=editor.get_document_info()print("Format:",info.format.name)print("Extension:",info.format.extension)print("MIME:",info.format.mime)print("Pages:",info.page_count)print("Size, bytes:",info.size)print("Encrypted:",info.is_encrypted)if__name__=="__main__":extracting_document_metainfo()
sample-document.docx is the sample file used in this example. Click here to download it.
Format: Office Open XML WordProcessingML Macro-Free Document (DOCX)
Extension: docx
MIME: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Pages: 3
Size, bytes: 49455
Encrypted: False