Retrieval-augmented generation (RAG) systems need documents in a clean, structured text format for chunking and embedding. Markdown is ideal – it preserves document structure (headings, lists, tables) while being easy to parse.
flowchart LR
A["PDF / DOCX / XLSX"]
B["GroupDocs.Markdown"]
C["Markdown"]
D["Text Chunking"]
E["Vector Embeddings"]
F["LLM Query"]
A --> B --> C --> D --> E --> F
Basic conversion for RAG
importrefromgroupdocs.markdownimportMarkdownConverter,ConvertOptions,SkipImagesStrategy,MarkdownFlavordefconvert_for_rag():"""Convert a PDF to Markdown for RAG pipelines, then split into chunks by heading."""# Step 1: Configure conversion for text-only RAG (skip images)options=ConvertOptions()options.image_export_strategy=SkipImagesStrategy()options.flavor=MarkdownFlavor.COMMON_MARK# Step 2: Convert the document using keyword argument for optionsmarkdown=MarkdownConverter.to_markdown("business-plan.pdf",convert_options=options)# Step 3: Split the Markdown into chunks by heading markerschunks=re.split(r"\n#{1,2} ",markdown)# Step 4: Process each chunk (e.g., send to an embedding model)forchunkinchunks:ifchunk.strip():print(f"Chunk ({len(chunk)} chars): {chunk[:80]}...")if__name__=="__main__":convert_for_rag()
business-plan.pdf is sample file used in this example. Click here to download it.
importosimportglobfromgroupdocs.markdownimportMarkdownConverter,ConvertOptions,SkipImagesStrategy,GroupDocsMarkdownExceptiondefbatch_convert_for_rag():"""Batch-convert all PDFs in a folder to Markdown for RAG ingestion."""# Step 1: Configure conversion to skip images (text-only RAG)options=ConvertOptions()options.image_export_strategy=SkipImagesStrategy()# Step 2: Find all PDF files in the documents folderfiles=glob.glob("documents/*.pdf")# Step 3: Convert each file, handling errors gracefullyforfileinfiles:try:markdown=MarkdownConverter.to_markdown(file,convert_options=options)output_path=os.path.splitext(file)[0]+".md"withopen(output_path,"w",encoding="utf-8")asf:f.write(markdown)print(f"Converted: {file}")exceptGroupDocsMarkdownExceptionasex:print(f"Skipped {file}: {ex}")if__name__=="__main__":batch_convert_for_rag()