Convert Files Within Document Containers

This topic covers how to convert files embedded within document containers, such as compressed or packaged files, into individual output files. The following diagram illustrates the process of extracting and converting files within a document container:

flowchart LR
    %% Nodes
    A["Document Container"]
    B["Extraction"]
    C["Conversion"]
    D["Converted File 1"]
    E["Converted File 2"]
    F["Converted File N"]

    %% Edge connections between nodes
    A --> B --> C --> D
    C --> E
    C --> F

The Extraction and Conversion processes are performed within a single call to the convert_multiple(folder_path, convert_options) method of the Converter class.

Document Container File Types

The following file types are considered document containers:

Email and Outlook

  • EML - Email Message File.
  • EMLX - Apple Mail Email File.
  • MSG - Microsoft Outlook Message File.
  • OST - Outlook Offline Data File.
  • PST - Outlook Personal Information Store File.

PDF

  • PDF - PDF files that contain embedded resources.

Word Processing

  • DOC - The older Microsoft Word binary format.
  • DOCX - The modern Word format.
  • DOT and DOTX - Word template files.
  • RTF - Rich Text Format.

Compression

  • 7Z - 7-Zip Compressed File.
  • BZ2 - Bzip2 Compressed File.
  • CAB - Windows Cabinet File.
  • CPIO - CPIO Compressed File.
  • GZ - Gnu Zipped Archive.
  • GZIP - Gzip Compressed File.
  • LZ - Lzip Compressed File.
  • LZMA - LZMA Compressed File.
  • RAR - RAR Compressed Archive.
  • TAR - Consolidated Unix File Archive.
  • XZ - Xz Compressed File.
  • Z - Unix Compressed File.
  • ZIP - ZIP Compressed File.

Example: Convert Files Within Document Container

The following example demonstrates how to convert each compressed file in ZIP archive to PDF:

The file name template for the output files is {file name}_{source file extension}.{output file extension}. In this example, compressed file business-plan.docx is being saved converted and saved with file name business-plan_docx.pdf.

from groupdocs.conversion import Converter
from groupdocs.conversion.options.convert import PdfConvertOptions

def convert_files_within_document_container():
    # Instantiate Converter with the input document 
    with Converter("./compressed.zip") as converter:
        # Instantiate convert options 
        pdf_convert_options = PdfConvertOptions()

        # Extract, convert and save output files in PDF format
        converter.convert_multiple("./converted-files", pdf_convert_options)    

if __name__ == "__main__":
    convert_files_within_document_container()

compressed.zip is the sample file used in this example. Click here to download it.

converted-files is the output folder path for the converted files. Click here to download it.