GroupDocs.Parser allows extracting metadata (document properties, file information) from various document formats including PDF, Emails, Ebooks, Microsoft Office (Word, PowerPoint, Excel), LibreOffice, and many others.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents for testing
Understanding of document metadata concepts
Extract metadata from documents
To extract metadata from documents, use the get_metadata() method:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract metadata from the documentmetadata=parser.get_metadata()# Check if metadata extraction is supportedifmetadataisNone:print("Metadata extraction isn't supported")else:# Iterate over metadata itemsforiteminmetadata:# Print metadata name and valueprint(f"{item.name}: {item.value}")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns a collection of MetadataItem objects containing document properties such as author, title, creation date, modified date, etc., or None if metadata extraction is not supported.
Common metadata properties
Typical metadata properties include:
Author: Document creator
Title: Document title
Subject: Document subject or description
Keywords: Document keywords
CreatedTime: Creation date and time
ModifiedTime: Last modification date and time
Application: Application used to create the document
Pages: Number of pages (for some formats)
Extract specific metadata fields
To extract and display specific metadata fields:
fromgroupdocs.parserimportParserdefget_metadata_value(metadata,field_name):"""
Get a specific metadata value by field name.
"""foriteminmetadata:ifitem.name.lower()==field_name.lower():returnitem.valuereturnNone# Create an instance of Parser classwithParser("document.docx")asparser:# Extract metadatametadata=parser.get_metadata()ifmetadata:# Convert to list for reusemetadata_list=list(metadata)# Extract specific fieldsauthor=get_metadata_value(metadata_list,"Author")title=get_metadata_value(metadata_list,"Title")created=get_metadata_value(metadata_list,"CreatedTime")modified=get_metadata_value(metadata_list,"ModifiedTime")# Print specific metadataprint(f"Document Information:")print(f" Title: {title}")print(f" Author: {author}")print(f" Created: {created}")print(f" Modified: {modified}")
The following sample file is used in this example: document.docx
Expected behavior: Retrieves specific metadata fields by name, making it easy to access commonly used properties.
Export metadata to dictionary
To convert metadata to a Python dictionary:
fromgroupdocs.parserimportParserdefextract_metadata_as_dict(file_path):"""
Extract document metadata as a dictionary.
"""withParser(file_path)asparser:metadata=parser.get_metadata()ifmetadataisNone:returnNone# Convert to dictionarymetadata_dict={}foriteminmetadata:metadata_dict[item.name]=item.valuereturnmetadata_dict# Usagemetadata=extract_metadata_as_dict("report.pdf")ifmetadata:print("Document Metadata:")forkey,valueinmetadata.items():print(f" {key}: {value}")else:print("Metadata extraction not supported")
The following sample file is used in this example: report.pdf
Expected behavior: Returns metadata as a dictionary with field names as keys and field values as values.
Export metadata to JSON
To export metadata to JSON format:
fromgroupdocs.parserimportParserimportjsondefexport_metadata_to_json(file_path,output_file):"""
Extract metadata and save to JSON file.
"""withParser(file_path)asparser:metadata=parser.get_metadata()ifmetadataisNone:print("Metadata extraction not supported")returnFalse# Build metadata dictionarymetadata_dict={'file_path':file_path,'properties':{}}foriteminmetadata:# Convert value to string for JSON serializationvalue=str(item.value)ifitem.valueisnotNoneelseNonemetadata_dict['properties'][item.name]=value# Save to JSONwithopen(output_file,'w',encoding='utf-8')asf:json.dump(metadata_dict,f,indent=2,ensure_ascii=False)print(f"Metadata exported to {output_file}")returnTrue# Usageexport_metadata_to_json("sample.docx","metadata.json")
The following sample file is used in this example: sample.docx
The following sample file is used in this example: metadata.json
Expected behavior: Creates a JSON file containing all document metadata in a structured format.
Batch metadata extraction
To extract metadata from multiple documents:
fromgroupdocs.parserimportParserfrompathlibimportPathimportcsvdefbatch_extract_metadata(input_dir,output_csv):"""
Extract metadata from all documents in a directory and save to CSV.
"""extensions=['.pdf','.docx','.doc','.xlsx','.xls','.pptx','.ppt']results=[]forfile_pathinPath(input_dir).rglob('*'):iffile_path.suffix.lower()inextensions:print(f"Processing: {file_path.name}")try:withParser(str(file_path))asparser:metadata=parser.get_metadata()ifmetadata:# Extract common fieldsmeta_dict={'file_name':file_path.name}foriteminmetadata:meta_dict[item.name]=str(item.value)ifitem.valueelse""results.append(meta_dict)exceptExceptionase:print(f" Error: {e}")# Save to CSVifresults:# Get all unique field namesall_fields=set()forresultinresults:all_fields.update(result.keys())# Write CSVwithopen(output_csv,'w',newline='',encoding='utf-8')ascsvfile:writer=csv.DictWriter(csvfile,fieldnames=sorted(all_fields))writer.writeheader()writer.writerows(results)print(f"Extractedmetadatafrom{len(results)}documents")print(f"Saved to {output_csv}")# Usagebatch_extract_metadata("documents","metadata_report.csv")
Expected behavior: Processes all documents in a directory, extracts metadata, and generates a CSV report.
Metadata comparison
To compare metadata between documents:
fromgroupdocs.parserimportParserdefcompare_metadata(file1,file2):"""
Compare metadata between two documents.
"""defget_metadata_dict(file_path):withParser(file_path)asparser:metadata=parser.get_metadata()ifmetadata:return{item.name:item.valueforiteminmetadata}return{}metadata1=get_metadata_dict(file1)metadata2=get_metadata_dict(file2)# Find common and unique fieldsall_fields=set(metadata1.keys())|set(metadata2.keys())common_fields=set(metadata1.keys())&set(metadata2.keys())print(f"Comparing: {file1} vs {file2}")print(f"{'Field':<20}{'File 1':<30}{'File 2':<30}{'Match':<10}")print("-"*95)forfieldinsorted(all_fields):val1=str(metadata1.get(field,"N/A"))[:30]val2=str(metadata2.get(field,"N/A"))[:30]match="Yes"iffieldincommon_fieldsandmetadata1.get(field)==metadata2.get(field)else"No"print(f"{field:<20}{val1:<30}{val2:<30}{match:<10}")# Usagecompare_metadata("version1.docx","version2.docx")
The following sample file is used in this example: version1.docx
The following sample file is used in this example: version2.docx
Expected behavior: Provides a side-by-side comparison of metadata fields from two documents.
Notes
The get_metadata() method returns None if metadata extraction is not supported
Use parser.features.metadata to check if metadata extraction is available
Metadata properties vary by document format
Some formats provide more metadata than others
Date/time values are typically returned as string or datetime objects