PDFMD
Back to blog

PDF to Markdown Python: 5 Methods Compared (with Code)

A practical comparison of 5 Python libraries for converting PDF to Markdown — Marker, MarkItDown, PyMuPDF4LLM, docling, and pdfplumber. With code examples, pros, and cons.

Jul 5, 2026PDF to MD Team

Python has no shortage of PDF to Markdown libraries. But which one actually produces clean, structured Markdown from a real PDF?

We tested 5 popular Python libraries with a 20-page research paper containing tables, citations, and multi-level headings. Here are the results.

Quick comparison

| Library | Tables | Headings | AI Cleanup | Speed | Best For | | --- | --- | --- | --- | --- | --- | | Marker | ✅ Excellent | ✅ | ✅ | Slow (GPU) | Best overall quality | | MarkItDown | ❌ Poor | ⚠️ | ❌ | Fast | Quick text extraction | | PyMuPDF4LLM | ✅ Good | ✅ | ❌ | Very fast | Speed + structure | | docling | ✅ Good | ✅ | ✅ | Medium | Document pipelines | | pdfplumber | ⚠️ Manual | ❌ | ❌ | Medium | Fine-grained control |

Method 1: Marker — Best overall quality

Marker uses deep learning models to detect layout, tables, and reading order. It produces the highest-quality Markdown of any Python library.

Installation

pip install marker-pdf

Code example

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("research_paper.pdf")

# Get Markdown output
markdown_text = rendered.markdown
print(markdown_text[:500])

# Save to file
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

Pros

  • Excellent table preservation
  • Detects heading hierarchy accurately
  • Repairs reading order for multi-column layouts
  • Active community

Cons

  • Requires GPU for reasonable speed
  • First run downloads large model files (~2GB)
  • Slower than other libraries

Best for

Projects where output quality matters more than speed.

Method 2: MarkItDown — Best for multi-format conversion

MarkItDown from Microsoft supports PDF, Word, Excel, PowerPoint, and more. It is simple to use but basic for PDFs.

Installation

pip install markitdown

Code example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")

markdown_text = result.text_content
print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

Pros

  • Supports many file formats
  • Very simple API
  • No GPU required
  • Backed by Microsoft

Cons

  • Tables often lost or garbled
  • No line wrap repair
  • No heading hierarchy detection
  • No page noise removal

Best for

Quick text extraction when structure is not critical, or when you need to convert multiple file types.

Method 3: PyMuPDF4LLM — Best for speed

PyMuPDF4LLM is the fastest option and produces decent Markdown structure.

Installation

pip install pymupdf4llm

Code example

import pymupdf4llm

# Basic conversion
markdown_text = pymupdf4llm.to_markdown("research_paper.pdf")

print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

# With page chunks for RAG
chunks = pymupdf4llm.to_markdown("research_paper.pdf", page_chunks=True)
for i, chunk in enumerate(chunks):
    print(f"Page {chunk['metadata']['page']}: {len(chunk['text'])} chars")

Pros

  • Very fast — processes 20 pages in seconds
  • Good table detection
  • Page chunk output for RAG
  • No GPU required
  • Well-documented

Cons

  • No AI cleanup for line wraps
  • No heading normalization
  • Limited handling of complex layouts

Best for

Speed-sensitive applications and developers who need page-level chunks.

Method 4: docling — Best for document understanding pipelines

docling from IBM focuses on document understanding with strong layout analysis.

Installation

pip install docling

Code example

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("research_paper.pdf")

# Get Markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

# Get structured data (JSON)
json_data = result.document.export_to_dict()

Pros

  • Strong layout analysis
  • Good table extraction
  • Exports to Markdown, JSON, and DocTags
  • Active development from IBM

Cons

  • Requires GPU for best performance
  • More complex API than MarkItDown
  • Slower than PyMuPDF4LLM

Best for

Teams building document processing pipelines that need structured output.

Method 5: pdfplumber — Best for fine-grained control

pdfplumber does not produce Markdown directly, but gives you the tools to build custom extraction logic.

Installation

pip install pdfplumber

Code example

import pdfplumber
import re

def pdf_to_markdown(pdf_path):
    markdown_lines = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            text = page.extract_text()
            if text:
                # Basic cleanup: join broken lines
                text = re.sub(r'-\n', '', text)  # Join hyphenated words
                text = re.sub(r'\n(?!\n)', ' ', text)  # Join regular lines
                markdown_lines.append(text)
            
            # Extract tables
            tables = page.extract_tables()
            for table in tables:
                if table:
                    # Convert to Markdown table
                    header = table[0]
                    rows = table[1:]
                    md_table = "| " + " | ".join(header) + " |\n"
                    md_table += "| " + " | ".join(["---"] * len(header)) + " |\n"
                    for row in rows:
                        md_table += "| " + " | ".join(str(c) for c in row) + " |\n"
                    markdown_lines.append(md_table)
            
            markdown_lines.append("\n---\n")  # Page separator
    
    return "\n".join(markdown_lines)

markdown_text = pdf_to_markdown("research_paper.pdf")
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

Pros

  • Full control over extraction logic
  • Good table extraction
  • No GPU required
  • Lightweight

Cons

  • No built-in Markdown output — you build it yourself
  • No heading detection
  • No AI cleanup
  • More code to write and maintain

Best for

Custom extraction pipelines where you need fine-grained control.

When to use an online tool instead

Python libraries are powerful but have trade-offs:

| Factor | Python Libraries | pdftomd.xyz | | --- | --- | --- | | Setup | Install Python, dependencies | None — browser-based | | GPU | Often required | Not needed | | AI cleanup | Varies by library | Always included | | RAG output | Build yourself | Built-in JSON + chunks | | Batch | Write your own script | Built-in batch mode | | Obsidian frontmatter | Build yourself | Built-in | | Cost | Free (open source) | Free preview, $9/mo for full |

If you want clean Markdown without writing code, pdftomd.xyz offers AI-powered cleanup, multiple output modes, and a free preview — no Python required.

Recommendation

  • Best quality: Marker (if you have a GPU)
  • Best speed: PyMuPDF4LLM
  • Best for multi-format: MarkItDown
  • Best for pipelines: docling
  • Best for custom control: pdfplumber
  • Best no-code option: pdftomd.xyz

For most developers, PyMuPDF4LLM is the best starting point — it is fast, produces good structure, and has a simple API. Switch to Marker if you need maximum quality and have a GPU.

For RAG pipelines specifically, consider using a Python library for extraction and pdftomd.xyz RAG-ready mode for chunk-friendly output with JSON export.

FAQ

What is the best Python library for PDF to Markdown?

For most use cases, PyMuPDF4LLM offers the best balance of speed, quality, and ease of use. For maximum quality, Marker is the best choice if you have a GPU.

Can I convert PDF to Markdown in Python for free?

Yes. All five libraries covered (Marker, MarkItDown, PyMuPDF4LLM, docling, pdfplumber) are free and open source.

Which Python library preserves tables best?

Marker has the best table preservation, followed by docling and PyMuPDF4LLM. MarkItDown and pdfplumber have limited or manual table support.

Can I use these libraries for RAG?

Yes, but you need to build the chunking logic yourself. PyMuPDF4LLM has a page_chunks parameter that helps. For built-in RAG output, use pdftomd.xyz RAG-ready mode.

Do I need a GPU to convert PDF to Markdown in Python?

Only Marker and docling benefit significantly from a GPU. PyMuPDF4LLM, MarkItDown, and pdfplumber run fine on CPU.


Want clean Markdown without writing code? Try pdftomd.xyz — free preview, no signup →

Related tools

Ready to convert your PDF?

Upload a PDF on the homepage and preview clean Markdown in seconds.

Try PDF to MD

Related articles