Skip to content

Pymupdf

PyMuPDF is a Python binding for the MuPDF library, which is a lightweight PDF, XPS and e-book viewer.

With version 1.23.0, PyMuPDF has added table recognition and extraction facilities to its rich set of features.

We find it tends to work well on dense tables, with a relatively simple structure. It's also very fast.

Parameters:

Name Type Description Default
parsing_algorithm Literal['pymupdf'] The library used for parsing, in this case, pymupdf. None
min_table_confidence float The minimum confidence score for a table to be extracted. 0.75
table_output_format Literal['html\|markdown'] The format of the extracted tables. Currently only html and markdown are supported. 'html'

Example

In the following example, we parse a 10-K document and extract the tables in markdown format.

# defining the parser (table_args is a dict)
parser = openparse.DocumentParser(
    table_args={
        "parsing_algorithm": "pymupdf",
        "table_output_format": "markdown"
    }
)

# ingesting the document
parsed_10k = parser.parse(meta10k_path)