diff --git a/README.md b/README.md index 848e1cf4ba9c3ea9137d439524916a2ae0d9ec3b..9d286916c5ab57140c10450af9d3b69f29e88f04 100644 --- a/README.md +++ b/README.md @@ -1,633 +1,199 @@ -
- -

- -

- - - -[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) -[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) -[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) -[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) -[![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/) -[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/) -[![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru) -[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru) -[![OpenDataLab](https://img.shields.io/badge/webapp_on_mineru.net-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) -[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) -[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) -[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb) -[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839) -[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU) - - -opendatalab%2FMinerU | Trendshift - - - -[English](README.md) | [简体中文](README_zh-CN.md) - - - -

-🚀Access MinerU Now→✅ Zero-Install Web Version ✅ Full-Featured Desktop Client ✅ Instant API Access; Skip deployment headaches – get all product formats in one click. Developers, dive in! -

- - - -

- 👋 join us on Discord and WeChat -

- -
- -# Changelog - -- 2025/07/16 2.1.1 Released - - Bug fixes - - Fixed text block content loss issue that could occur in certain `pipeline` scenarios #3005 - - Fixed issue where `sglang-client` required unnecessary packages like `torch` #2968 - - Updated `dockerfile` to fix incomplete text content parsing due to missing fonts in Linux #2915 - - Usability improvements - - Updated `compose.yaml` to facilitate direct startup of `sglang-server`, `mineru-api`, and `mineru-gradio` services - - Launched brand new [online documentation site](https://opendatalab.github.io/MinerU/), simplified readme, providing better documentation experience -- 2025/07/05 Version 2.1.0 Released - - This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows: - - **Performance Optimizations:** - - Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side). - - Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages). - - Layout analysis speed of the `pipeline` backend has been increased by approximately 20%. - - **Experience Enhancements:** - - Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](#3-api-calls-or-visual-invocation). - - Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer). - - Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`. - - Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](#4-extending-mineru-functionality-through-configuration-files). - - **New Features:** - - Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html) - - Introduced limited support for vertical text layout in the `pipeline` backend. - -
- History Log -
- 2025/06/20 2.0.6 Released - -
- -
- 2025/06/17 2.0.5 Released - -
- -
- 2025/06/15 2.0.3 released - -
- -
- 2025/06/13 2.0.0 Released - -
-
- 2025/05/24 Release 1.3.12 - -
- -
- 2025/04/29 Release 1.3.10 - -
- -
- 2025/04/27 Release 1.3.9 - -
- -
- 2025/04/23 Release 1.3.8 - -
- -
- 2025/04/22 Release 1.3.7 - -
- -
- 2025/04/16 Release 1.3.4 - -
- -
- 2025/04/12 Release 1.3.2 - -
- -
- 2025/04/08 Release 1.3.1 - -
- -
- 2025/04/03 Release 1.3.0 - -
-
- 2025/03/03 1.2.1 released - -
- -
- 2025/02/24 1.2.0 released -

This version includes several fixes and improvements to enhance parsing efficiency and accuracy:

- -
- -
- 2025/01/22 1.1.0 released -

In this version we have focused on improving parsing accuracy and efficiency:

- -
- -
- 2025/01/10 1.0.1 released -

This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:

- -
- -
- 2024/11/22 0.10.0 released -

Introducing hybrid OCR text extraction capabilities:

- -
- -
- 2024/11/15 0.9.3 released -

Integrated RapidTable for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.

-
- -
- 2024/11/06 0.9.2 released -

Integrated the StructTable-InternVL2-1B model for table recognition functionality.

-
- -
- 2024/10/31 0.9.0 released -

This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:

- -
- -
- 2024/09/27 Version 0.8.1 released -

Fixed some bugs, and providing a localized deployment version of the online demo and the front-end interface.

-
- -
- 2024/09/09 Version 0.8.0 released -

Supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.

-
- -
- 2024/08/30 Version 0.7.1 released -

Add paddle tablemaster table recognition option

-
- -
- 2024/08/09 Version 0.7.0b1 released -

Simplified installation process, added table recognition functionality

-
- -
- 2024/08/01 Version 0.6.2b1 released -

Optimized dependency conflict issues and installation documentation

-
- -
- 2024/07/05 Initial open-source release -
-
+ # MinerU -## Project Introduction - -MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. -MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models. -Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**. - -https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c - -## Key Features - -- Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence. -- Output text in human-readable order, suitable for single-column, multi-column, and complex layouts. -- Preserve the structure of the original document, including headings, paragraphs, lists, etc. -- Extract images, image descriptions, tables, table titles, and footnotes. -- Automatically recognize and convert formulas in the document to LaTeX format. -- Automatically recognize and convert tables in the document to HTML format. -- Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality. -- OCR supports detection and recognition of 84 languages. -- Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats. -- Supports various visualization results, including layout visualization and span visualization, for efficient confirmation of output quality. -- Supports running in a pure CPU environment, and also supports GPU(CUDA)/NPU(CANN)/MPS acceleration -- Compatible with Windows, Linux, and Mac platforms. - -# Quick Start - -If you encounter any installation issues, please first consult the FAQ.
-If the parsing results are not as expected, refer to the Known Issues.
- -## Online Experience - -### Official online web application -The official online version has the same functionality as the client, with a beautiful interface and rich features, requires login to use - -- [![OpenDataLab](https://img.shields.io/badge/webapp_on_mineru.net-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) - -### Gradio-based online demo -A WebUI developed based on Gradio, with a simple interface and only core parsing functionality, no login required - -- [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) -- [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) - -## Local Deployment - - -> [!WARNING] -> **Pre-installation Notice—Hardware and Software Environment Support** -> -> To ensure the stability and reliability of the project, we only optimize and test for specific hardware and software environments during development. This ensures that users deploying and running the project on recommended system configurations will get the best performance with the fewest compatibility issues. -> -> By focusing resources on the mainline environment, our team can more efficiently resolve potential bugs and develop new features. -> -> In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parsing Backendpipelinevlm-transformersvlm-sglang
Operating SystemLinux / Windows / macOSLinux / WindowsLinux / Windows (via WSL2)
CPU Inference Support
GPU RequirementsTuring architecture and later, 6GB+ VRAM or Apple SiliconTuring architecture and later, 8GB+ VRAM
Memory RequirementsMinimum 16GB+, recommended 32GB+
Disk Space Requirements20GB+, SSD recommended
Python Version3.10-3.13
- -### Install MinerU - -#### Install MinerU using pip or uv +MinerU 是一个实用工具,用于将文件(如 PDF)转换为 Markdown 格式,支持通过命令行、API、WebUI �://opendatalab.github.io/MinerU/zh/quick_start/docker_deployment/). + +## 安装 MinerU + +### 使用 pip 或 uv 安装 MinerU + +MinerU 推荐使用 `pip` 或 `uv` 进行安装。确保 Python 版本在 3.10-3.13 范围内。 + ```bash pip install --upgrade pip pip install uv -uv pip install -U "mineru[core]" +uv pip install mineru ``` -#### Install MinerU from source code +### 从源码安装 MinerU + +如果希望从源码安装,可以使用以下命令: + ```bash -git clone https://github.com/opendatalab/MinerU.git +git clone https://gitee.com/open-data-lab/MinerU.git cd MinerU -uv pip install -e .[core] +pip install -e . ``` -> [!TIP] -> `mineru[core]` includes all core features except `sglang` acceleration, compatible with Windows / Linux / macOS systems, suitable for most users. -> If you need to use `sglang` acceleration for VLM model inference or install a lightweight client on edge devices, please refer to the documentation [Extension Modules Installation Guide](https://opendatalab.github.io/MinerU/quick_start/extension_modules/). +### 使用 Docker 部署 MinerU + +MinerU 提供了便捷的 Docker 部署方式,可以快速搭建环境并解决一些复杂的依赖问题。有关详细说明,请参见 [Docker 部署文档](https://opendatalab.github.io/MinerU/zh/quick_start/docker_deployment/). + +## 使用 MinerU ---- - -#### Deploy MinerU using Docker -MinerU provides a convenient Docker deployment method, which helps quickly set up the environment and solve some tricky environment compatibility issues. -You can get the [Docker Deployment Instructions](https://opendatalab.github.io/MinerU/quick_start/docker_deployment/) in the documentation. +MinerU 支持多种使用方式,包括命令行、API、以及 WebUI。 ---- +### 快速使用示例 -### Using MinerU +最简单的命令行调用方式如下: -The simplest command line invocation is: ```bash mineru -p -o ``` -You can use MinerU for PDF parsing through various methods such as command line, API, and WebUI. For detailed instructions, please refer to the [Usage Guide](https://opendatalab.github.io/MinerU/usage/). - -# TODO - -- [x] Reading order based on the model -- [x] Recognition of `index` and `list` in the main text -- [x] Table recognition -- [x] Heading Classification -- [x] Handwritten Text Recognition -- [x] Vertical Text Recognition -- [x] Latin Accent Mark Recognition -- [ ] Code block recognition in the main text -- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf) -- [ ] Geometric shape recognition - -# Known Issues - -- Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts. -- Limited support for vertical text. -- Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized. -- Code blocks are not yet supported in the layout model. -- Comic books, art albums, primary school textbooks, and exercises cannot be parsed well. -- Table recognition may result in row/column recognition errors in complex tables. -- OCR recognition may produce inaccurate characters in PDFs of lesser-known languages (e.g., diacritical marks in Latin script, easily confused characters in Arabic script). -- Some formulas may not render correctly in Markdown. - -# FAQ - -- If you encounter any issues during usage, you can first check the [FAQ](https://opendatalab.github.io/MinerU/faq/) for solutions. -- If your issue remains unresolved, you may also use [DeepWiki](https://deepwiki.com/opendatalab/MinerU) to interact with an AI assistant, which can address most common problems. -- If you still cannot resolve the issue, you are welcome to join our community via [Discord](https://discord.gg/Tdedn9GTXq) or [WeChat](http://mineru.space/s/V85Yl) to discuss with other users and developers. - -# All Thanks To Our Contributors - - - - - -# License Information - -[LICENSE.md](LICENSE.md) - -Currently, some models in this project are trained based on YOLO. However, since YOLO follows the AGPL license, it may impose restrictions on certain use cases. In future iterations, we plan to explore and replace these with models under more permissive licenses to enhance user-friendliness and flexibility. - -# Acknowledgments - -- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) -- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) -- [UniMERNet](https://github.com/opendatalab/UniMERNet) -- [RapidTable](https://github.com/RapidAI/RapidTable) -- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch) -- [layoutreader](https://github.com/ppaanngggg/layoutreader) -- [xy-cut](https://github.com/Sanster/xy-cut) -- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) -- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2) -- [pdftext](https://github.com/datalab-to/pdftext) -- [pdfminer.six](https://github.com/pdfminer/pdfminer.six) -- [pypdf](https://github.com/py-pdf/pypdf) - -# Citation - -```bibtex -@misc{wang2024mineruopensourcesolutionprecise, - title={MinerU: An Open-Source Solution for Precise Document Content Extraction}, - author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He}, - year={2024}, - eprint={2409.18839}, - archivePrefix={arXiv}, - primaryClass={cs.CV}, - url={https://arxiv.org/abs/2409.18839}, -} - -@article{he2024opendatalab, - title={Opendatalab: Empowering general artificial intelligence with open datasets}, - author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, - journal={arXiv preprint arXiv:2407.13773}, - year={2024} -} +例如,使用 `pipeline` 后端进行 PDF 解析: + +```bash +mineru -p ./example.pdf -o ./output/ +``` + +你可以通过以下参数进行更详细的配置: + +- `-m`, `--method`: 指定解析方法,支持 `auto`, `txt`, `ocr`。 +- `-b`, `--backend`: 指定解析后端,支持 `pipeline`, `vlm-transformers`, `vlm-sglang-engine`, `vlm-sglang-client`。 +- `-l`, `--lang`: 指定文档语言,如 `ch`(中文)、`en`(英文)等。 +- `-u`, `--url`: 当使用 `vlm-sglang-client` 后端时,需指定服务地址,如 `http://127.0.0.1:30000`. +- `-f`, `--formula`: 启用公式识别(默认为 `True`)。 +- `-t`, `--table`: 启用表格识别(默认为 `True`)。 + +### 高级使用 + +#### 使用 SGLang 加速 VLM 模型推理 + +MinerU 支持使用 `sglang` 来加速 VLM 模型推理。可以通过以下方式启用: + +```bash +mineru -p ./example.pdf -o ./output/ --backend vlm-sglang-engine --server-url http://127.0.0.1:30000 ``` -# Star History - - - - - - Star History Chart - - - - -# Links -- [Easy Data Preparation with latest LLMs-based Operators and Pipelines](https://github.com/OpenDCAI/DataFlow) -- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3) -- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU) -- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM) -- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit) -- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench) -- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html) -- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc) +#### 使用本地模型 + +如果希望使用本地模型而非远程下载,可以设置环境变量 `MINERU_MODEL_SOURCE=local`,并确保模型文件已下载到本地存储。 + +```bash +export MINERU_MODEL_SOURCE=local +mineru -p ./example.pdf -o ./output/ +``` + +## 输出文件说明 + +MinerU 在解析文件后,会生成多个输出文件,包括: + +- **Markdown 文件**(`content.md`): 转换后的 Markdown 文本。 +- **中间 JSON 文件**(`middle.json`): 包含文档解析的结构化数据。 +- **模型输出文件**(`model_output.txt`): 包含模型的原始输出信息。 +- **可视化调试文件**(`layout.pdf`, `spans.pdf`): 用于调试的可视化 PDF。 + +## 快速开始 + +### 本地部署 + +MinerU 提供了多种部署方式,包括使用 pip 安装、源码安装和 Docker 部署。推荐使用 Docker 部署,以确保运行环境一致性和简化安装流程。 + +#### 使用 Docker 部署(推荐) + +1. 构建 Docker 镜像: + +```bash +docker build -t mineru -f docker/global/Dockerfile . +``` + +2. 启动 Docker 容器: + +```bash +docker run -d -v $PWD:/app -p 8000:8000 --name mineru-container mineru +``` + +#### 使用 CLI 工具 + +MinerU 提供了命令行工具,可以快速进行文档转换: + +```bash +mineru -p ./example.pdf -o ./output/ +``` + +## 版本历史 + +- **2024/08/01 - v0.6.2b1**: 优化依赖冲突问题和安装文档。 +- **2024/07/05 - 初始开源版本**: 提供基础的 PDF 转 Markdown 功能。 + +## 支持的环境 + +- **Python 版本**: 3.10-3.13 +- **后端支持**: + - `pipeline`: 通用解析模式 + - `vlm-transformers`: 通用 VLM 模式 + - `vlm-sglang-engine`: 快速推理(需 `sglang` 环境) + - `vlm-sglang-client`: 客户端模式(需连接 `sglang-server`) + +## 常见问题 + +### 1. 如何切换模型源? + +可以通过环境变量或命令行参数切换模型源: + +```bash +export MINERU_MODEL_SOURCE=modelscope +``` + +或者使用命令行参数: + +```bash +mineru -p ./example.pdf -o ./output/ --source modelscope +``` + +### 2. 如何启用 OCR? + +如果文档为图像 PDF,可以启用 OCR: + +```bash +mineru -p ./example.pdf -o ./output/ --method ocr +``` + +### 3. Docker 部署是否推荐? + +是的,Docker 部署方式可以确保在不同平台上具有相同的运行环境,并简化依赖管理。 + +### 4. 如何在旧版 Linux 系统上使用? + +MinerU 提供了 `pipeline_old_linux` 模式,适用于老旧 Linux 系统: + +```bash +uv pip install mineru[pipeline_old_linux] +``` + +## 项目支持 + +MinerU 提供多个子项目支持不同场景下的使用,包括: + +- **multi_gpu_v2**: 基于 LitServe 的多 GPU 并行处理。 +- **mcp**: 基于 FastMCP 的文档转 Markdown 服务。 + +如需更多信息,请参考 [项目列表](https://gitee.com/open-data-lab/MinerU/projects)。 + +## 许可证信息 + +MinerU 遵循 [MinerU Contributor License Agreement](MinerU_CLA.md),所有贡献者需签署该协议以确保开源社区的合法性和可持续性。 + +## 感谢贡献者 + +感谢所有为 MinerU 项目做出贡献的开发者和测试人员,你们的努力使得该项目能够不断完善和优化。 + +## 引用 + +如果您在研究或产品中使用了 MinerU,请引用以下内容(如有): + +> MinerU: A Practical Tool for PDF to Markdown Conversion, OpenDataLab. https://gitee.com/open-data-lab/MinerU + +如需查看详细的引用方式,请参考项目文档。 + +## Star 历史 + +项目 Star 数量反映了社区对该工具的持续支持。你可以在 [Gitee 项目页面](https://gitee.com/open-data-lab/MinerU) 上查看 Star 历史。 + +## 联系与支持 + +- **官方网址**: [https://mineru.net](https://mineru.net) +- **源码仓库**: [https://gitee.com/open-data-lab/MinerU](https://gitee.com/open-data-lab/MinerU) +- **社区支持**: 如遇问题,请提交 [GitHub Issues](https://gitee.com/open-data-lab/MinerU/issues)。 \ No newline at end of file