PDF 转 Markdown Python：5 种方案对比（含代码）

实用对比 5 款 Python PDF 转 Markdown 库 — Marker、MarkItDown、PyMuPDF4LLM、docling 和 pdfplumber。含代码示例、优缺点分析。

2026年7月5日PDF to MD Team

Python 不缺 PDF 转 Markdown 库。但哪个能从真实 PDF 生成干净、结构化的 Markdown？

我们用一份 20 页含表格、引用和多级标题的论文测试了 5 款流行 Python 库。以下是结果。

快速对比

| 库 | 表格 | 标题 | AI 清洗 | 速度 | 适合 | | --- | --- | --- | --- | --- | --- | | Marker | ✅ 优秀 | ✅ | ✅ | 慢（需 GPU） | 最佳综合质量 | | MarkItDown | ❌ 差 | ⚠️ | ❌ | 快 | 快速文本提取 | | PyMuPDF4LLM | ✅ 好 | ✅ | ❌ | 很快 | 速度+结构 | | docling | ✅ 好 | ✅ | ✅ | 中等 | 文档管线 | | pdfplumber | ⚠️ 手动 | ❌ | ❌ | 中等 | 精细控制 |

方法 1：Marker — 最佳综合质量

Marker 使用深度学习模型检测版面、表格和阅读顺序。它在所有 Python 库中产生最高质量的 Markdown。

安装

pip install marker-pdf

代码示例

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("research_paper.pdf")

# 获取 Markdown 输出
markdown_text = rendered.markdown
print(markdown_text[:500])

# 保存到文件
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

优点

表格保留优秀
标题层级检测准确
修复多栏版面阅读顺序
社区活跃

缺点

需要 GPU 才能有合理速度
首次运行下载大模型文件（约 2GB）
比其他库慢

适合

输出质量比速度更重要的项目。

方法 2：MarkItDown — 多格式转换

MarkItDown 来自微软，支持 PDF、Word、Excel、PowerPoint 等。简单易用但对 PDF 较基础。

安装

pip install markitdown

代码示例

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")

markdown_text = result.text_content
print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

优点

支持多种文件格式
API 非常简单
不需要 GPU
微软背书

缺点

表格常丢失或乱码
无断行修复
无标题层级检测
无页面噪声去除

适合

结构不重要的快速文本提取，或需要转换多种文件类型。

方法 3：PyMuPDF4LLM — 速度优先

PyMuPDF4LLM 是最快的选项，产生不错的 Markdown 结构。

安装

pip install pymupdf4llm

代码示例

import pymupdf4llm

# 基础转换
markdown_text = pymupdf4llm.to_markdown("research_paper.pdf")

print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

# 带页 chunk 用于 RAG
chunks = pymupdf4llm.to_markdown("research_paper.pdf", page_chunks=True)
for i, chunk in enumerate(chunks):
    print(f"Page {chunk['metadata']['page']}: {len(chunk['text'])} chars")

优点

非常快 — 20 页几秒处理完
表格检测好
页 chunk 输出支持 RAG
不需要 GPU
文档完善

缺点

无 AI 断行修复
无标题规范化
复杂版面处理有限

适合

速度敏感的应用和需要页级 chunk 的开发者。

方法 4：docling — 文档理解管线

docling 来自 IBM，专注于文档理解，版面分析强。

安装

pip install docling

代码示例

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("research_paper.pdf")

# 获取 Markdown
markdown_text = result.document.export_to_markdown()
print(markdown_text[:500])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

# 获取结构化数据（JSON）
json_data = result.document.export_to_dict()

优点

版面分析强
表格提取好
导出 Markdown、JSON 和 DocTags
IBM 积极开发

缺点

最佳性能需 GPU
API 比 MarkItDown 复杂
比 PyMuPDF4LLM 慢

适合

需要结构化输出的文档处理管线团队。

方法 5：pdfplumber — 精细控制

pdfplumber 不直接生成 Markdown，但提供构建自定义提取逻辑的工具。

安装

pip install pdfplumber

代码示例

import pdfplumber
import re

def pdf_to_markdown(pdf_path):
    markdown_lines = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # 提取文本
            text = page.extract_text()
            if text:
                # 基础清洗：合并断行
                text = re.sub(r'-\n', '', text)  # 合并连字符断词
                text = re.sub(r'\n(?!\n)', ' ', text)  # 合并普通断行
                markdown_lines.append(text)
            
            # 提取表格
            tables = page.extract_tables()
            for table in tables:
                if table:
                    # 转为 Markdown 表格
                    header = table[0]
                    rows = table[1:]
                    md_table = "| " + " | ".join(header) + " |\n"
                    md_table += "| " + " | ".join(["---"] * len(header)) + " |\n"
                    for row in rows:
                        md_table += "| " + " | ".join(str(c) for c in row) + " |\n"
                    markdown_lines.append(md_table)
            
            markdown_lines.append("\n---\n")  # 页分隔符
    
    return "\n".join(markdown_lines)

markdown_text = pdf_to_markdown("research_paper.pdf")
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

优点

完全控制提取逻辑
表格提取好
不需要 GPU
轻量

缺点

无内置 Markdown 输出 — 需自己构建
无标题检测
无 AI 清洗
需要更多代码

适合

需要精细控制的自定义提取管线。

何时使用在线工具

Python 库很强大但有取舍：

| 因素 | Python 库 | pdftomd.xyz | | --- | --- | --- | | 安装 | 需安装 Python 和依赖 | 无需 — 浏览器即可 | | GPU | 常需要 | 不需要 | | AI 清洗 | 因库而异 | 始终包含 | | RAG 输出 | 自己构建 | 内置 JSON + chunks | | 批量 | 自己写脚本 | 内置批量模式 | | Obsidian frontmatter | 自己构建 | 内置 | | 费用 | 免费（开源） | 免费预览，¥65/月完整 |

如果你想不写代码就得到干净 Markdown，pdftomd.xyz 提供 AI 驱动清洗、多输出模式和免费预览 — 无需 Python。

FAQ

最好的 PDF 转 Markdown Python 库是哪个？

大多数场景下，PyMuPDF4LLM 在速度、质量和易用性之间平衡最好。需要最高质量且有 GPU 时，Marker 是最佳选择。

可以免费用 Python 转 PDF 为 Markdown 吗？

可以。本文介绍的 5 个库（Marker、MarkItDown、PyMuPDF4LLM、docling、pdfplumber）都是免费开源的。

哪个 Python 库表格保留最好？

Marker 表格保留最好，其次是 docling 和 PyMuPDF4LLM。MarkItDown 和 pdfplumber 表格支持有限或需手动处理。

这些库可以用于 RAG 吗？

可以，但需要自己构建分块逻辑。PyMuPDF4LLM 有 page_chunks 参数。内置 RAG 输出用 pdftomd.xyz RAG-ready 模式。

转 PDF 需要 GPU 吗？

只有 Marker 和 docling 显著受益于 GPU。PyMuPDF4LLM、MarkItDown 和 pdfplumber 在 CPU 上运行良好。

想不写代码就得到干净 Markdown？试试 pdftomd.xyz — 免费预览，无需注册 →

准备好转换 PDF 了吗？

在首页上传 PDF，几秒内预览干净的 Markdown。

试用 PDF to MD

2026年7月5日

快速对比

方法 1：Marker — 最佳综合质量

安装

代码示例

优点

缺点

适合

方法 2：MarkItDown — 多格式转换

安装

代码示例

优点

缺点

适合

方法 3：PyMuPDF4LLM — 速度优先

安装

代码示例

优点

缺点

适合

方法 4：docling — 文档理解管线

安装

代码示例

优点

缺点

适合

方法 5：pdfplumber — 精细控制

安装

代码示例

优点

缺点

适合

何时使用在线工具

推荐

FAQ

最好的 PDF 转 Markdown Python 库是哪个？

可以免费用 Python 转 PDF 为 Markdown 吗？

哪个 Python 库表格保留最好？

这些库可以用于 RAG 吗？

转 PDF 需要 GPU 吗？

相关工具

准备好转换 PDF 了吗？

相关文章

2026 年 PDF 转 Markdown 工具终极对比：10 款工具横评

MarkItDown 替代方案：为什么你需要更好的 PDF 转 Markdown 工具

如何转换 PDF 为 Markdown 且尽量不丢格式