PDF to Markdown Python: 5 Methods Compared (with Code)

A practical comparison of 5 Python libraries for converting PDF to Markdown — Marker, MarkItDown, PyMuPDF4LLM, docling, and pdfplumber. With code examples, pros, and cons.

Jul 5, 2026PDF to MD Team

Python has no shortage of PDF to Markdown libraries. But which one actually produces clean, structured Markdown from a real PDF?

We tested 5 popular Python libraries with a 20-page research paper containing tables, citations, and multi-level headings. Here are the results.

Quick comparison

| Library | Tables | Headings | AI Cleanup | Speed | Best For | | --- | --- | --- | --- | --- | --- | | Marker | ✅ Excellent | ✅ | ✅ | Slow (GPU) | Best overall quality | | MarkItDown | ❌ Poor | ⚠️ | ❌ | Fast | Quick text extraction | | PyMuPDF4LLM | ✅ Good | ✅ | ❌ | Very fast | Speed + structure | | docling | ✅ Good | ✅ | ✅ | Medium | Document pipelines | | pdfplumber | ⚠️ Manual | ❌ | ❌ | Medium | Fine-grained control |

Method 1: Marker — Best overall quality

Marker uses deep learning models to detect layout, tables, and reading order. It produces the highest-quality Markdown of any Python library.

Installation

pip install marker-pdf

Code example

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("research_paper.pdf")

# Get Markdown output
markdown_text = rendered.markdown
print(markdown_text[:500])

# Save to file
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

Pros

Excellent table preservation
Detects heading hierarchy accurately
Repairs reading order for multi-column layouts
Active community

Cons

Requires GPU for reasonable speed
First run downloads large model files (~2GB)
Slower than other libraries

Best for

Projects where output quality matters more than speed.

Method 2: MarkItDown — Best for multi-format conversion

MarkItDown from Microsoft supports PDF, Word, Excel, PowerPoint, and more. It is simple to use but basic for PDFs.

Installation

pip install markitdown

Code example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")

markdown_text = result.text_content
print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

Pros

Supports many file formats
Very simple API
No GPU required
Backed by Microsoft

Cons

Tables often lost or garbled
No line wrap repair
No heading hierarchy detection
No page noise removal

Best for

Quick text extraction when structure is not critical, or when you need to convert multiple file types.

Method 3: PyMuPDF4LLM — Best for speed

PyMuPDF4LLM is the fastest option and produces decent Markdown structure.

Installation

pip install pymupdf4llm

Code example

import pymupdf4llm

# Basic conversion
markdown_text = pymupdf4llm.to_markdown("research_paper.pdf")

print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

# With page chunks for RAG
chunks = pymupdf4llm.to_markdown("research_paper.pdf", page_chunks=True)
for i, chunk in enumerate(chunks):
    print(f"Page {chunk['metadata']['page']}: {len(chunk['text'])} chars")

Pros

Very fast — processes 20 pages in seconds
Good table detection
Page chunk output for RAG
No GPU required
Well-documented

Cons

No AI cleanup for line wraps
No heading normalization
Limited handling of complex layouts

Best for

Speed-sensitive applications and developers who need page-level chunks.

Method 4: docling — Best for document understanding pipelines

docling from IBM focuses on document understanding with strong layout analysis.

Installation

pip install docling

Code example

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("research_paper.pdf")

# Get Markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

# Get structured data (JSON)
json_data = result.document.export_to_dict()

Pros

Strong layout analysis
Good table extraction
Exports to Markdown, JSON, and DocTags
Active development from IBM

Cons

Requires GPU for best performance
More complex API than MarkItDown
Slower than PyMuPDF4LLM

Best for

Teams building document processing pipelines that need structured output.

Method 5: pdfplumber — Best for fine-grained control

pdfplumber does not produce Markdown directly, but gives you the tools to build custom extraction logic.

Installation

pip install pdfplumber

Code example

import pdfplumber
import re

def pdf_to_markdown(pdf_path):
    markdown_lines = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            text = page.extract_text()
            if text:
                # Basic cleanup: join broken lines
                text = re.sub(r'-\n', '', text)  # Join hyphenated words
                text = re.sub(r'\n(?!\n)', ' ', text)  # Join regular lines
                markdown_lines.append(text)
            
            # Extract tables
            tables = page.extract_tables()
            for table in tables:
                if table:
                    # Convert to Markdown table
                    header = table[0]
                    rows = table[1:]
                    md_table = "| " + " | ".join(header) + " |\n"
                    md_table += "| " + " | ".join(["---"] * len(header)) + " |\n"
                    for row in rows:
                        md_table += "| " + " | ".join(str(c) for c in row) + " |\n"
                    markdown_lines.append(md_table)
            
            markdown_lines.append("\n---\n")  # Page separator
    
    return "\n".join(markdown_lines)

markdown_text = pdf_to_markdown("research_paper.pdf")
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

Pros

Full control over extraction logic
Good table extraction
No GPU required
Lightweight

Cons

No built-in Markdown output — you build it yourself
No heading detection
No AI cleanup
More code to write and maintain

Best for

Custom extraction pipelines where you need fine-grained control.

When to use an online tool instead

Python libraries are powerful but have trade-offs:

| Factor | Python Libraries | pdftomd.xyz | | --- | --- | --- | | Setup | Install Python, dependencies | None — browser-based | | GPU | Often required | Not needed | | AI cleanup | Varies by library | Always included | | RAG output | Build yourself | Built-in JSON + chunks | | Batch | Write your own script | Built-in batch mode | | Obsidian frontmatter | Build yourself | Built-in | | Cost | Free (open source) | Free preview, $9/mo for full |

If you want clean Markdown without writing code, pdftomd.xyz offers AI-powered cleanup, multiple output modes, and a free preview — no Python required.

Recommendation

Best quality: Marker (if you have a GPU)
Best speed: PyMuPDF4LLM
Best for multi-format: MarkItDown
Best for pipelines: docling
Best for custom control: pdfplumber
Best no-code option: pdftomd.xyz

For most developers, PyMuPDF4LLM is the best starting point — it is fast, produces good structure, and has a simple API. Switch to Marker if you need maximum quality and have a GPU.

For RAG pipelines specifically, consider using a Python library for extraction and pdftomd.xyz RAG-ready mode for chunk-friendly output with JSON export.

FAQ

What is the best Python library for PDF to Markdown?

For most use cases, PyMuPDF4LLM offers the best balance of speed, quality, and ease of use. For maximum quality, Marker is the best choice if you have a GPU.

Can I convert PDF to Markdown in Python for free?

Yes. All five libraries covered (Marker, MarkItDown, PyMuPDF4LLM, docling, pdfplumber) are free and open source.

Which Python library preserves tables best?

Marker has the best table preservation, followed by docling and PyMuPDF4LLM. MarkItDown and pdfplumber have limited or manual table support.

Can I use these libraries for RAG?

Yes, but you need to build the chunking logic yourself. PyMuPDF4LLM has a page_chunks parameter that helps. For built-in RAG output, use pdftomd.xyz RAG-ready mode.

Do I need a GPU to convert PDF to Markdown in Python?

Only Marker and docling benefit significantly from a GPU. PyMuPDF4LLM, MarkItDown, and pdfplumber run fine on CPU.

Want clean Markdown without writing code? Try pdftomd.xyz — free preview, no signup →

Related tools

MD Converter PDF to Markdown Pricing

Ready to convert your PDF?

Upload a PDF on the homepage and preview clean Markdown in seconds.

Try PDF to MD

Jul 5, 2026

Quick comparison

Method 1: Marker — Best overall quality

Installation

Code example

Pros

Cons

Best for

Method 2: MarkItDown — Best for multi-format conversion

Installation

Code example

Pros

Cons

Best for

Method 3: PyMuPDF4LLM — Best for speed

Installation

Code example

Pros

Cons

Best for

Method 4: docling — Best for document understanding pipelines

Installation

Code example

Pros

Cons

Best for

Method 5: pdfplumber — Best for fine-grained control

Installation

Code example

Pros

Cons

Best for

When to use an online tool instead

Recommendation

FAQ

What is the best Python library for PDF to Markdown?

Can I convert PDF to Markdown in Python for free?

Which Python library preserves tables best?

Can I use these libraries for RAG?

Do I need a GPU to convert PDF to Markdown in Python?

Related tools

Ready to convert your PDF?

Related articles

Best PDF to Markdown Converters in 2026: 10 Tools Compared

MarkItDown Alternative: Why You Need a Better PDF to Markdown Tool

How to Convert PDF to Markdown Without Losing Formatting