PDF to Markdown Python: 5 Methods Compared (with Code)
A practical comparison of 5 Python libraries for converting PDF to Markdown — Marker, MarkItDown, PyMuPDF4LLM, docling, and pdfplumber. With code examples, pros, and cons.
Python has no shortage of PDF to Markdown libraries. But which one actually produces clean, structured Markdown from a real PDF?
We tested 5 popular Python libraries with a 20-page research paper containing tables, citations, and multi-level headings. Here are the results.
Quick comparison
| Library | Tables | Headings | AI Cleanup | Speed | Best For | | --- | --- | --- | --- | --- | --- | | Marker | ✅ Excellent | ✅ | ✅ | Slow (GPU) | Best overall quality | | MarkItDown | ❌ Poor | ⚠️ | ❌ | Fast | Quick text extraction | | PyMuPDF4LLM | ✅ Good | ✅ | ❌ | Very fast | Speed + structure | | docling | ✅ Good | ✅ | ✅ | Medium | Document pipelines | | pdfplumber | ⚠️ Manual | ❌ | ❌ | Medium | Fine-grained control |
Method 1: Marker — Best overall quality
Marker uses deep learning models to detect layout, tables, and reading order. It produces the highest-quality Markdown of any Python library.
Installation
pip install marker-pdf
Code example
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("research_paper.pdf")
# Get Markdown output
markdown_text = rendered.markdown
print(markdown_text[:500])
# Save to file
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_text)
Pros
- Excellent table preservation
- Detects heading hierarchy accurately
- Repairs reading order for multi-column layouts
- Active community
Cons
- Requires GPU for reasonable speed
- First run downloads large model files (~2GB)
- Slower than other libraries
Best for
Projects where output quality matters more than speed.
Method 2: MarkItDown — Best for multi-format conversion
MarkItDown from Microsoft supports PDF, Word, Excel, PowerPoint, and more. It is simple to use but basic for PDFs.
Installation
pip install markitdown
Code example
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("research_paper.pdf")
markdown_text = result.text_content
print(markdown_text[:500])
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_text)
Pros
- Supports many file formats
- Very simple API
- No GPU required
- Backed by Microsoft
Cons
- Tables often lost or garbled
- No line wrap repair
- No heading hierarchy detection
- No page noise removal
Best for
Quick text extraction when structure is not critical, or when you need to convert multiple file types.
Method 3: PyMuPDF4LLM — Best for speed
PyMuPDF4LLM is the fastest option and produces decent Markdown structure.
Installation
pip install pymupdf4llm
Code example
import pymupdf4llm
# Basic conversion
markdown_text = pymupdf4llm.to_markdown("research_paper.pdf")
print(markdown_text[:500])
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_text)
# With page chunks for RAG
chunks = pymupdf4llm.to_markdown("research_paper.pdf", page_chunks=True)
for i, chunk in enumerate(chunks):
print(f"Page {chunk['metadata']['page']}: {len(chunk['text'])} chars")
Pros
- Very fast — processes 20 pages in seconds
- Good table detection
- Page chunk output for RAG
- No GPU required
- Well-documented
Cons
- No AI cleanup for line wraps
- No heading normalization
- Limited handling of complex layouts
Best for
Speed-sensitive applications and developers who need page-level chunks.
Method 4: docling — Best for document understanding pipelines
docling from IBM focuses on document understanding with strong layout analysis.
Installation
pip install docling
Code example
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("research_paper.pdf")
# Get Markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text[:500])
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_text)
# Get structured data (JSON)
json_data = result.document.export_to_dict()
Pros
- Strong layout analysis
- Good table extraction
- Exports to Markdown, JSON, and DocTags
- Active development from IBM
Cons
- Requires GPU for best performance
- More complex API than MarkItDown
- Slower than PyMuPDF4LLM
Best for
Teams building document processing pipelines that need structured output.
Method 5: pdfplumber — Best for fine-grained control
pdfplumber does not produce Markdown directly, but gives you the tools to build custom extraction logic.
Installation
pip install pdfplumber
Code example
import pdfplumber
import re
def pdf_to_markdown(pdf_path):
markdown_lines = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extract text
text = page.extract_text()
if text:
# Basic cleanup: join broken lines
text = re.sub(r'-\n', '', text) # Join hyphenated words
text = re.sub(r'\n(?!\n)', ' ', text) # Join regular lines
markdown_lines.append(text)
# Extract tables
tables = page.extract_tables()
for table in tables:
if table:
# Convert to Markdown table
header = table[0]
rows = table[1:]
md_table = "| " + " | ".join(header) + " |\n"
md_table += "| " + " | ".join(["---"] * len(header)) + " |\n"
for row in rows:
md_table += "| " + " | ".join(str(c) for c in row) + " |\n"
markdown_lines.append(md_table)
markdown_lines.append("\n---\n") # Page separator
return "\n".join(markdown_lines)
markdown_text = pdf_to_markdown("research_paper.pdf")
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_text)
Pros
- Full control over extraction logic
- Good table extraction
- No GPU required
- Lightweight
Cons
- No built-in Markdown output — you build it yourself
- No heading detection
- No AI cleanup
- More code to write and maintain
Best for
Custom extraction pipelines where you need fine-grained control.
When to use an online tool instead
Python libraries are powerful but have trade-offs:
| Factor | Python Libraries | pdftomd.xyz | | --- | --- | --- | | Setup | Install Python, dependencies | None — browser-based | | GPU | Often required | Not needed | | AI cleanup | Varies by library | Always included | | RAG output | Build yourself | Built-in JSON + chunks | | Batch | Write your own script | Built-in batch mode | | Obsidian frontmatter | Build yourself | Built-in | | Cost | Free (open source) | Free preview, $9/mo for full |
If you want clean Markdown without writing code, pdftomd.xyz offers AI-powered cleanup, multiple output modes, and a free preview — no Python required.
Recommendation
- Best quality: Marker (if you have a GPU)
- Best speed: PyMuPDF4LLM
- Best for multi-format: MarkItDown
- Best for pipelines: docling
- Best for custom control: pdfplumber
- Best no-code option: pdftomd.xyz
For most developers, PyMuPDF4LLM is the best starting point — it is fast, produces good structure, and has a simple API. Switch to Marker if you need maximum quality and have a GPU.
For RAG pipelines specifically, consider using a Python library for extraction and pdftomd.xyz RAG-ready mode for chunk-friendly output with JSON export.
FAQ
What is the best Python library for PDF to Markdown?
For most use cases, PyMuPDF4LLM offers the best balance of speed, quality, and ease of use. For maximum quality, Marker is the best choice if you have a GPU.
Can I convert PDF to Markdown in Python for free?
Yes. All five libraries covered (Marker, MarkItDown, PyMuPDF4LLM, docling, pdfplumber) are free and open source.
Which Python library preserves tables best?
Marker has the best table preservation, followed by docling and PyMuPDF4LLM. MarkItDown and pdfplumber have limited or manual table support.
Can I use these libraries for RAG?
Yes, but you need to build the chunking logic yourself. PyMuPDF4LLM has a page_chunks parameter that helps. For built-in RAG output, use pdftomd.xyz RAG-ready mode.
Do I need a GPU to convert PDF to Markdown in Python?
Only Marker and docling benefit significantly from a GPU. PyMuPDF4LLM, MarkItDown, and pdfplumber run fine on CPU.
Want clean Markdown without writing code? Try pdftomd.xyz — free preview, no signup →
Related tools
Ready to convert your PDF?
Upload a PDF on the homepage and preview clean Markdown in seconds.
Try PDF to MD