From 7b37943e28c793a47284970fbe824e1b0e23da8b Mon Sep 17 00:00:00 2001 From: gitee-bot Date: Thu, 17 Jul 2025 16:30:25 +0000 Subject: [PATCH] Update README.md --- README.md | 788 ++++++++++++------------------------------------------ 1 file changed, 177 insertions(+), 611 deletions(-) diff --git a/README.md b/README.md index 848e1cf4..9d286916 100644 --- a/README.md +++ b/README.md @@ -1,633 +1,199 @@ -
- -

- -

- - - -[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) -[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) -[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) -[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) -[![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/) -[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/) -[![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru) -[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru) -[![OpenDataLab](https://img.shields.io/badge/webapp_on_mineru.net-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) -[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) -[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) -[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb) -[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839) -[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU) - - -opendatalab%2FMinerU | Trendshift - - - -[English](README.md) | [简体中文](README_zh-CN.md) - - - -

-🚀Access MinerU Now→✅ Zero-Install Web Version ✅ Full-Featured Desktop Client ✅ Instant API Access; Skip deployment headaches – get all product formats in one click. Developers, dive in! -

- - - -

- 👋 join us on Discord and WeChat -

- -
- -# Changelog - -- 2025/07/16 2.1.1 Released - - Bug fixes - - Fixed text block content loss issue that could occur in certain `pipeline` scenarios #3005 - - Fixed issue where `sglang-client` required unnecessary packages like `torch` #2968 - - Updated `dockerfile` to fix incomplete text content parsing due to missing fonts in Linux #2915 - - Usability improvements - - Updated `compose.yaml` to facilitate direct startup of `sglang-server`, `mineru-api`, and `mineru-gradio` services - - Launched brand new [online documentation site](https://opendatalab.github.io/MinerU/), simplified readme, providing better documentation experience -- 2025/07/05 Version 2.1.0 Released - - This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows: - - **Performance Optimizations:** - - Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side). - - Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages). - - Layout analysis speed of the `pipeline` backend has been increased by approximately 20%. - - **Experience Enhancements:** - - Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](#3-api-calls-or-visual-invocation). - - Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer). - - Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`. - - Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](#4-extending-mineru-functionality-through-configuration-files). - - **New Features:** - - Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html) - - Introduced limited support for vertical text layout in the `pipeline` backend. - -
- History Log -
- 2025/06/20 2.0.6 Released -
    -
  • Fixed occasional parsing interruptions caused by invalid block content in vlm mode
  • -
  • Fixed parsing interruptions caused by incomplete table structures in vlm mode
  • -
-
- -
- 2025/06/17 2.0.5 Released -
    -
  • Fixed the issue where models were still required to be downloaded in the sglang-client mode
  • -
  • Fixed the issue where the sglang-client mode unnecessarily depended on packages like torch during runtime.
  • -
  • Fixed the issue where only the first instance would take effect when attempting to launch multiple sglang-client instances via multiple URLs within the same process
  • -
-
- -
- 2025/06/15 2.0.3 released -
    -
  • Fixed a configuration file key-value update error that occurred when downloading model type was set to all
  • -
  • Fixed the issue where the formula and table feature toggle switches were not working in command line mode, causing the features to remain enabled.
  • -
  • Fixed compatibility issues with sglang version 0.4.7 in the sglang-engine mode.
  • -
  • Updated Dockerfile and installation documentation for deploying the full version of MinerU in sglang environment
  • -
-
- -
- 2025/06/13 2.0.0 Released -
    -
  • New Architecture: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility. -
      -
    • Removal of Third-party Dependency Limitations: Completely eliminated the dependency on pymupdf, moving the project toward a more open and compliant open-source direction.
    • -
    • Ready-to-use, Easy Configuration: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.
    • -
    • Automatic Model Management: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.
    • -
    • Offline Deployment Friendly: Provides built-in model download commands, supporting deployment requirements in completely offline environments.
    • -
    • Streamlined Code Structure: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.
    • -
    • Unified Intermediate Format Output: Adopted standardized middle_json format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.
    • -
    -
  • -
  • New Model: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding. -
      -
    • Small Model, Big Capabilities: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.
    • -
    • Multiple Functions in One: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.
    • -
    • Ultimate Inference Speed: Achieves peak throughput exceeding 10,000 tokens/s through sglang acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.
    • -
    • Online Experience: You can experience our brand-new VLM model on MinerU.net, Hugging Face, and ModelScope.
    • -
    -
  • -
  • Incompatible Changes Notice: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes: -
      -
    • Python package name changed from magic-pdf to mineru, and the command-line tool changed from magic-pdf to mineru. Please update your scripts and command calls accordingly.
    • -
    • For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.
    • -
    -
  • -
-
-
- 2025/05/24 Release 1.3.12 -
    -
  • Added support for PPOCRv5 models, updated ch_server model to PP-OCRv5_rec_server, and ch_lite model to PP-OCRv5_rec_mobile (model update required) -
      -
    • In testing, we found that PPOCRv5(server) has some improvement for handwritten documents, but has slightly lower accuracy than v4_server_doc for other document types, so the default ch model remains unchanged as PP-OCRv4_server_rec_doc.
    • -
    • Since PPOCRv5 has enhanced recognition capabilities for handwriting and special characters, you can manually choose the PPOCRv5 model for Japanese-Traditional Chinese mixed scenarios and handwritten documents
    • -
    • You can select the appropriate model through the lang parameter lang='ch_server' (Python API) or --lang ch_server (command line): -
        -
      • ch: PP-OCRv4_server_rec_doc (default) (Chinese/English/Japanese/Traditional Chinese mixed/15K dictionary)
      • -
      • ch_server: PP-OCRv5_rec_server (Chinese/English/Japanese/Traditional Chinese mixed + handwriting/18K dictionary)
      • -
      • ch_lite: PP-OCRv5_rec_mobile (Chinese/English/Japanese/Traditional Chinese mixed + handwriting/18K dictionary)
      • -
      • ch_server_v4: PP-OCRv4_rec_server (Chinese/English mixed/6K dictionary)
      • -
      • ch_lite_v4: PP-OCRv4_rec_mobile (Chinese/English mixed/6K dictionary)
      • -
      -
    • -
    -
  • -
  • Added support for handwritten documents through optimized layout recognition of handwritten text areas -
      -
    • This feature is supported by default, no additional configuration required
    • -
    • You can refer to the instructions above to manually select the PPOCRv5 model for better handwritten document parsing results
    • -
    -
  • -
  • The huggingface and modelscope demos have been updated to versions that support handwriting recognition and PPOCRv5 models, which you can experience online
  • -
-
- -
- 2025/04/29 Release 1.3.10 -
    -
  • Added support for custom formula delimiters, which can be configured by modifying the latex-delimiter-config section in the magic-pdf.json file in your user directory.
  • -
-
- -
- 2025/04/27 Release 1.3.9 -
    -
  • Optimized formula parsing functionality, improved formula rendering success rate
  • -
-
- -
- 2025/04/23 Release 1.3.8 -
    -
  • The default ocr model (ch) has been updated to PP-OCRv4_server_rec_doc (model update required) -
      -
    • PP-OCRv4_server_rec_doc is trained on a mixture of more Chinese document data and PP-OCR training data based on PP-OCRv4_server_rec, adding recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It can recognize over 15,000 characters and improves both document-specific and general text recognition abilities.
    • -
    • Performance comparison of PP-OCRv4_server_rec_doc/PP-OCRv4_server_rec/PP-OCRv4_mobile_rec
    • -
    • After verification, the PP-OCRv4_server_rec_doc model shows significant accuracy improvements in Chinese/English/Japanese/Traditional Chinese in both single language and mixed language scenarios, with comparable speed to PP-OCRv4_server_rec, making it suitable for most use cases.
    • -
    • In some pure English scenarios, PP-OCRv4_server_rec_doc may have word adhesion issues, while PP-OCRv4_server_rec performs better in these cases. Therefore, we've kept the PP-OCRv4_server_rec model, which users can access by adding the parameter lang='ch_server' (Python API) or --lang ch_server (command line).
    • -
    -
  • -
-
- -
- 2025/04/22 Release 1.3.7 -
    -
  • Fixed the issue where the lang parameter was ineffective during table parsing model initialization
  • -
  • Fixed the significant speed reduction of OCR and table parsing in cpu mode
  • -
-
- -
- 2025/04/16 Release 1.3.4 -
    -
  • Slightly improved OCR-det speed by removing some unnecessary blocks
  • -
  • Fixed page-internal sorting errors caused by footnotes in certain cases
  • -
-
- -
- 2025/04/12 Release 1.3.2 -
    -
  • Fixed dependency version incompatibility issues when installing on Windows with Python 3.13
  • -
  • Optimized memory usage during batch inference
  • -
  • Improved parsing of tables rotated 90 degrees
  • -
  • Enhanced parsing of oversized tables in financial report samples
  • -
  • Fixed the occasional word adhesion issue in English text areas when OCR language is not specified (model update required)
  • -
-
- -
- 2025/04/08 Release 1.3.1 -
    -
  • Fixed several compatibility issues -
      -
    • Added support for Python 3.13
    • -
    • Made final adaptations for outdated Linux systems (such as CentOS 7) with no guarantee of continued support in future versions, installation instructions
    • -
    -
  • -
-
- -
- 2025/04/03 Release 1.3.0 -
    -
  • Installation and compatibility optimizations -
      -
    • Resolved compatibility issues caused by detectron2 by removing layoutlmv3 usage in layout
    • -
    • Extended torch version compatibility to 2.2~2.6 (excluding 2.5)
    • -
    • Added CUDA compatibility for versions 11.8/12.4/12.6/12.8 (CUDA version determined by torch), solving compatibility issues for users with 50-series and H-series GPUs
    • -
    • Extended Python compatibility to versions 3.10~3.12, fixing the issue of automatic downgrade to version 0.6.1 when installing in non-3.10 environments
    • -
    • Optimized offline deployment process, eliminating the need to download any model files after successful deployment
    • -
    -
  • -
  • Performance optimizations -
      -
    • Enhanced parsing speed for batches of small files by supporting batch processing of multiple PDF files (script example), with formula parsing speed improved by up to 1400% and overall parsing speed improved by up to 500% compared to version 1.0.1
    • -
    • Reduced memory usage and improved parsing speed by optimizing MFR model loading and usage (requires re-running the model download process to get incremental updates to model files)
    • -
    • Optimized GPU memory usage, requiring only 6GB minimum to run this project
    • -
    • Improved running speed on MPS devices
    • -
    -
  • -
  • Parsing effect optimizations -
      -
    • Updated MFR model to unimernet(2503), fixing line break loss issues in multi-line formulas
    • -
    -
  • -
  • Usability optimizations -
      -
    • Completely replaced the paddle framework and paddleocr in the project by using paddleocr2torch, resolving conflicts between paddle and torch, as well as thread safety issues caused by the paddle framework
    • -
    • Added real-time progress bar display during parsing, allowing precise tracking of parsing progress and making the waiting process more bearable
    • -
    -
  • -
-
-
- 2025/03/03 1.2.1 released -
    -
  • Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers
  • -
  • Fixed caption matching inaccuracies in certain scenarios
  • -
  • Fixed formula span loss issues in certain scenarios
  • -
-
- -
- 2025/02/24 1.2.0 released -

This version includes several fixes and improvements to enhance parsing efficiency and accuracy:

-
    -
  • Performance Optimization -
      -
    • Increased classification speed for PDF documents in auto mode.
    • -
    -
  • -
  • Parsing Optimization -
      -
    • Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.
    • -
    • Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.
    • -
    -
  • -
  • Bug Fixes -
      -
    • Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.
    • -
    • Resolved an issue where title blocks were empty in some cases.
    • -
    -
  • -
-
- -
- 2025/01/22 1.1.0 released -

In this version we have focused on improving parsing accuracy and efficiency:

-
    -
  • Model capability upgrade (requires re-executing the model download process to obtain incremental updates of model files) -
      -
    • The layout recognition model has been upgraded to the latest doclayout_yolo(2501) model, improving layout recognition accuracy.
    • -
    • The formula parsing model has been upgraded to the latest unimernet(2501) model, improving formula recognition accuracy.
    • -
    -
  • -
  • Performance optimization -
      -
    • On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.
    • -
    -
  • -
  • Parsing effect optimization -
      -
    • Added a new heading classification feature (testing version, enabled by default) to the online demo (mineru.net/huggingface/modelscope), which supports hierarchical classification of headings, thereby enhancing document structuring.
    • -
    -
  • -
-
- -
- 2025/01/10 1.0.1 released -

This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:

-
    -
  • New API Interface -
      -
    • For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
    • -
    • For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.
    • -
    -
  • -
  • Enhanced Compatibility -
      -
    • By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.
    • -
    • We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. Ascend NPU Acceleration
    • -
    -
  • -
  • Automatic Language Identification -
      -
    • By introducing a new language recognition model, setting the lang configuration to auto during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.
    • -
    -
  • -
-
- -
- 2024/11/22 0.10.0 released -

Introducing hybrid OCR text extraction capabilities:

-
    -
  • Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.
  • -
  • Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.
  • -
-
- -
- 2024/11/15 0.9.3 released -

Integrated RapidTable for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.

-
- -
- 2024/11/06 0.9.2 released -

Integrated the StructTable-InternVL2-1B model for table recognition functionality.

-
- -
- 2024/10/31 0.9.0 released -

This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:

-
    -
  • Refactored the sorting module code to use layoutreader for reading order sorting, ensuring high accuracy in various layouts.
  • -
  • Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.
  • -
  • Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.
  • -
  • Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.
  • -
  • Added multi-language support for OCR, supporting detection and recognition of 84 languages. For the list of supported languages, see OCR Language Support List.
  • -
  • Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.
  • -
  • Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.
  • -
  • Integrated PDF-Extract-Kit 1.0: -
      -
    • Added the self-developed doclayout_yolo model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with layoutlmv3 via the configuration file.
    • -
    • Upgraded formula parsing to unimernet 0.2.1, improving formula parsing accuracy while significantly reducing memory usage.
    • -
    • Due to the repository change for PDF-Extract-Kit 1.0, you need to re-download the model. Please refer to How to Download Models for detailed steps.
    • -
    -
  • -
-
- -
- 2024/09/27 Version 0.8.1 released -

Fixed some bugs, and providing a localized deployment version of the online demo and the front-end interface.

-
- -
- 2024/09/09 Version 0.8.0 released -

Supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.

-
- -
- 2024/08/30 Version 0.7.1 released -

Add paddle tablemaster table recognition option

-
- -
- 2024/08/09 Version 0.7.0b1 released -

Simplified installation process, added table recognition functionality

-
- -
- 2024/08/01 Version 0.6.2b1 released -

Optimized dependency conflict issues and installation documentation

-
- -
- 2024/07/05 Initial open-source release -
-
+ # MinerU -## Project Introduction - -MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. -MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models. -Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**. - -https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c - -## Key Features - -- Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence. -- Output text in human-readable order, suitable for single-column, multi-column, and complex layouts. -- Preserve the structure of the original document, including headings, paragraphs, lists, etc. -- Extract images, image descriptions, tables, table titles, and footnotes. -- Automatically recognize and convert formulas in the document to LaTeX format. -- Automatically recognize and convert tables in the document to HTML format. -- Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality. -- OCR supports detection and recognition of 84 languages. -- Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats. -- Supports various visualization results, including layout visualization and span visualization, for efficient confirmation of output quality. -- Supports running in a pure CPU environment, and also supports GPU(CUDA)/NPU(CANN)/MPS acceleration -- Compatible with Windows, Linux, and Mac platforms. - -# Quick Start - -If you encounter any installation issues, please first consult the FAQ.
-If the parsing results are not as expected, refer to the Known Issues.
- -## Online Experience - -### Official online web application -The official online version has the same functionality as the client, with a beautiful interface and rich features, requires login to use - -- [![OpenDataLab](https://img.shields.io/badge/webapp_on_mineru.net-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) - -### Gradio-based online demo -A WebUI developed based on Gradio, with a simple interface and only core parsing functionality, no login required - -- [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) -- [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) - -## Local Deployment - - -> [!WARNING] -> **Pre-installation Notice—Hardware and Software Environment Support** -> -> To ensure the stability and reliability of the project, we only optimize and test for specific hardware and software environments during development. This ensures that users deploying and running the project on recommended system configurations will get the best performance with the fewest compatibility issues. -> -> By focusing resources on the mainline environment, our team can more efficiently resolve potential bugs and develop new features. -> -> In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parsing Backendpipelinevlm-transformersvlm-sglang
Operating SystemLinux / Windows / macOSLinux / WindowsLinux / Windows (via WSL2)
CPU Inference Support
GPU RequirementsTuring architecture and later, 6GB+ VRAM or Apple SiliconTuring architecture and later, 8GB+ VRAM
Memory RequirementsMinimum 16GB+, recommended 32GB+
Disk Space Requirements20GB+, SSD recommended
Python Version3.10-3.13
- -### Install MinerU - -#### Install MinerU using pip or uv +MinerU 是一个实用工具,用于将文件(如 PDF)转换为 Markdown 格式,支持通过命令行、API、WebUI �://opendatalab.github.io/MinerU/zh/quick_start/docker_deployment/). + +## 安装 MinerU + +### 使用 pip 或 uv 安装 MinerU + +MinerU 推荐使用 `pip` 或 `uv` 进行安装。确保 Python 版本在 3.10-3.13 范围内。 + ```bash pip install --upgrade pip pip install uv -uv pip install -U "mineru[core]" +uv pip install mineru ``` -#### Install MinerU from source code +### 从源码安装 MinerU + +如果希望从源码安装,可以使用以下命令: + ```bash -git clone https://github.com/opendatalab/MinerU.git +git clone https://gitee.com/open-data-lab/MinerU.git cd MinerU -uv pip install -e .[core] +pip install -e . ``` -> [!TIP] -> `mineru[core]` includes all core features except `sglang` acceleration, compatible with Windows / Linux / macOS systems, suitable for most users. -> If you need to use `sglang` acceleration for VLM model inference or install a lightweight client on edge devices, please refer to the documentation [Extension Modules Installation Guide](https://opendatalab.github.io/MinerU/quick_start/extension_modules/). +### 使用 Docker 部署 MinerU + +MinerU 提供了便捷的 Docker 部署方式,可以快速搭建环境并解决一些复杂的依赖问题。有关详细说明,请参见 [Docker 部署文档](https://opendatalab.github.io/MinerU/zh/quick_start/docker_deployment/). + +## 使用 MinerU ---- - -#### Deploy MinerU using Docker -MinerU provides a convenient Docker deployment method, which helps quickly set up the environment and solve some tricky environment compatibility issues. -You can get the [Docker Deployment Instructions](https://opendatalab.github.io/MinerU/quick_start/docker_deployment/) in the documentation. +MinerU 支持多种使用方式,包括命令行、API、以及 WebUI。 ---- +### 快速使用示例 -### Using MinerU +最简单的命令行调用方式如下: -The simplest command line invocation is: ```bash mineru -p -o ``` -You can use MinerU for PDF parsing through various methods such as command line, API, and WebUI. For detailed instructions, please refer to the [Usage Guide](https://opendatalab.github.io/MinerU/usage/). - -# TODO - -- [x] Reading order based on the model -- [x] Recognition of `index` and `list` in the main text -- [x] Table recognition -- [x] Heading Classification -- [x] Handwritten Text Recognition -- [x] Vertical Text Recognition -- [x] Latin Accent Mark Recognition -- [ ] Code block recognition in the main text -- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf) -- [ ] Geometric shape recognition - -# Known Issues - -- Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts. -- Limited support for vertical text. -- Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized. -- Code blocks are not yet supported in the layout model. -- Comic books, art albums, primary school textbooks, and exercises cannot be parsed well. -- Table recognition may result in row/column recognition errors in complex tables. -- OCR recognition may produce inaccurate characters in PDFs of lesser-known languages (e.g., diacritical marks in Latin script, easily confused characters in Arabic script). -- Some formulas may not render correctly in Markdown. - -# FAQ - -- If you encounter any issues during usage, you can first check the [FAQ](https://opendatalab.github.io/MinerU/faq/) for solutions. -- If your issue remains unresolved, you may also use [DeepWiki](https://deepwiki.com/opendatalab/MinerU) to interact with an AI assistant, which can address most common problems. -- If you still cannot resolve the issue, you are welcome to join our community via [Discord](https://discord.gg/Tdedn9GTXq) or [WeChat](http://mineru.space/s/V85Yl) to discuss with other users and developers. - -# All Thanks To Our Contributors - - - - - -# License Information - -[LICENSE.md](LICENSE.md) - -Currently, some models in this project are trained based on YOLO. However, since YOLO follows the AGPL license, it may impose restrictions on certain use cases. In future iterations, we plan to explore and replace these with models under more permissive licenses to enhance user-friendliness and flexibility. - -# Acknowledgments - -- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) -- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) -- [UniMERNet](https://github.com/opendatalab/UniMERNet) -- [RapidTable](https://github.com/RapidAI/RapidTable) -- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) -- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch) -- [layoutreader](https://github.com/ppaanngggg/layoutreader) -- [xy-cut](https://github.com/Sanster/xy-cut) -- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) -- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2) -- [pdftext](https://github.com/datalab-to/pdftext) -- [pdfminer.six](https://github.com/pdfminer/pdfminer.six) -- [pypdf](https://github.com/py-pdf/pypdf) - -# Citation - -```bibtex -@misc{wang2024mineruopensourcesolutionprecise, - title={MinerU: An Open-Source Solution for Precise Document Content Extraction}, - author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He}, - year={2024}, - eprint={2409.18839}, - archivePrefix={arXiv}, - primaryClass={cs.CV}, - url={https://arxiv.org/abs/2409.18839}, -} - -@article{he2024opendatalab, - title={Opendatalab: Empowering general artificial intelligence with open datasets}, - author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, - journal={arXiv preprint arXiv:2407.13773}, - year={2024} -} +例如,使用 `pipeline` 后端进行 PDF 解析: + +```bash +mineru -p ./example.pdf -o ./output/ +``` + +你可以通过以下参数进行更详细的配置: + +- `-m`, `--method`: 指定解析方法,支持 `auto`, `txt`, `ocr`。 +- `-b`, `--backend`: 指定解析后端,支持 `pipeline`, `vlm-transformers`, `vlm-sglang-engine`, `vlm-sglang-client`。 +- `-l`, `--lang`: 指定文档语言,如 `ch`(中文)、`en`(英文)等。 +- `-u`, `--url`: 当使用 `vlm-sglang-client` 后端时,需指定服务地址,如 `http://127.0.0.1:30000`. +- `-f`, `--formula`: 启用公式识别(默认为 `True`)。 +- `-t`, `--table`: 启用表格识别(默认为 `True`)。 + +### 高级使用 + +#### 使用 SGLang 加速 VLM 模型推理 + +MinerU 支持使用 `sglang` 来加速 VLM 模型推理。可以通过以下方式启用: + +```bash +mineru -p ./example.pdf -o ./output/ --backend vlm-sglang-engine --server-url http://127.0.0.1:30000 ``` -# Star History - - - - - - Star History Chart - - - - -# Links -- [Easy Data Preparation with latest LLMs-based Operators and Pipelines](https://github.com/OpenDCAI/DataFlow) -- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3) -- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU) -- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM) -- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit) -- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench) -- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html) -- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc) +#### 使用本地模型 + +如果希望使用本地模型而非远程下载,可以设置环境变量 `MINERU_MODEL_SOURCE=local`,并确保模型文件已下载到本地存储。 + +```bash +export MINERU_MODEL_SOURCE=local +mineru -p ./example.pdf -o ./output/ +``` + +## 输出文件说明 + +MinerU 在解析文件后,会生成多个输出文件,包括: + +- **Markdown 文件**(`content.md`): 转换后的 Markdown 文本。 +- **中间 JSON 文件**(`middle.json`): 包含文档解析的结构化数据。 +- **模型输出文件**(`model_output.txt`): 包含模型的原始输出信息。 +- **可视化调试文件**(`layout.pdf`, `spans.pdf`): 用于调试的可视化 PDF。 + +## 快速开始 + +### 本地部署 + +MinerU 提供了多种部署方式,包括使用 pip 安装、源码安装和 Docker 部署。推荐使用 Docker 部署,以确保运行环境一致性和简化安装流程。 + +#### 使用 Docker 部署(推荐) + +1. 构建 Docker 镜像: + +```bash +docker build -t mineru -f docker/global/Dockerfile . +``` + +2. 启动 Docker 容器: + +```bash +docker run -d -v $PWD:/app -p 8000:8000 --name mineru-container mineru +``` + +#### 使用 CLI 工具 + +MinerU 提供了命令行工具,可以快速进行文档转换: + +```bash +mineru -p ./example.pdf -o ./output/ +``` + +## 版本历史 + +- **2024/08/01 - v0.6.2b1**: 优化依赖冲突问题和安装文档。 +- **2024/07/05 - 初始开源版本**: 提供基础的 PDF 转 Markdown 功能。 + +## 支持的环境 + +- **Python 版本**: 3.10-3.13 +- **后端支持**: + - `pipeline`: 通用解析模式 + - `vlm-transformers`: 通用 VLM 模式 + - `vlm-sglang-engine`: 快速推理(需 `sglang` 环境) + - `vlm-sglang-client`: 客户端模式(需连接 `sglang-server`) + +## 常见问题 + +### 1. 如何切换模型源? + +可以通过环境变量或命令行参数切换模型源: + +```bash +export MINERU_MODEL_SOURCE=modelscope +``` + +或者使用命令行参数: + +```bash +mineru -p ./example.pdf -o ./output/ --source modelscope +``` + +### 2. 如何启用 OCR? + +如果文档为图像 PDF,可以启用 OCR: + +```bash +mineru -p ./example.pdf -o ./output/ --method ocr +``` + +### 3. Docker 部署是否推荐? + +是的,Docker 部署方式可以确保在不同平台上具有相同的运行环境,并简化依赖管理。 + +### 4. 如何在旧版 Linux 系统上使用? + +MinerU 提供了 `pipeline_old_linux` 模式,适用于老旧 Linux 系统: + +```bash +uv pip install mineru[pipeline_old_linux] +``` + +## 项目支持 + +MinerU 提供多个子项目支持不同场景下的使用,包括: + +- **multi_gpu_v2**: 基于 LitServe 的多 GPU 并行处理。 +- **mcp**: 基于 FastMCP 的文档转 Markdown 服务。 + +如需更多信息,请参考 [项目列表](https://gitee.com/open-data-lab/MinerU/projects)。 + +## 许可证信息 + +MinerU 遵循 [MinerU Contributor License Agreement](MinerU_CLA.md),所有贡献者需签署该协议以确保开源社区的合法性和可持续性。 + +## 感谢贡献者 + +感谢所有为 MinerU 项目做出贡献的开发者和测试人员,你们的努力使得该项目能够不断完善和优化。 + +## 引用 + +如果您在研究或产品中使用了 MinerU,请引用以下内容(如有): + +> MinerU: A Practical Tool for PDF to Markdown Conversion, OpenDataLab. https://gitee.com/open-data-lab/MinerU + +如需查看详细的引用方式,请参考项目文档。 + +## Star 历史 + +项目 Star 数量反映了社区对该工具的持续支持。你可以在 [Gitee 项目页面](https://gitee.com/open-data-lab/MinerU) 上查看 Star 历史。 + +## 联系与支持 + +- **官方网址**: [https://mineru.net](https://mineru.net) +- **源码仓库**: [https://gitee.com/open-data-lab/MinerU](https://gitee.com/open-data-lab/MinerU) +- **社区支持**: 如遇问题,请提交 [GitHub Issues](https://gitee.com/open-data-lab/MinerU/issues)。 \ No newline at end of file -- Gitee