Unitable
Unitable is a deep learning approach to table detection and extraction. It achieves state-of-the-art (SOTA) performance on four of the largest TR datasets. If table accuracy is your primary concern, this is the method to use.
Full credit goes to ShengYun (Anthony) Peng and his team for open sourcing their research in a reproducible manner. You can find the original repository with full training code here. We choose to directly use a small subset of their package along with their pre-trained weights.
Installation¶
ML Dependencies Required
To use this method, you will need to install the ml dependencies by running pip install "openparse[ml]"
.
Once you have pip installed openparse, you will need to download the weights of the model seperately by running the following command.
Which will download the weights. They're about 1.5GB in size.
Parameters¶
Name | Type | Description | Default |
---|---|---|---|
parsing_algorithm | Literal["unitable"] | The library used for parsing, in this case, unitable. | None |
min_table_confidence | float | The minimum confidence score for a table to be extracted. | 0.75 |
table_output_format | Literal["html"] | The format of the extracted tables. Currently only support html. | None |
Example¶
parser = openparse.DocumentParser(
table_args={
"parsing_algorithm": "unitable",
"min_table_confidence": 0.8,
}
)
parsed_doc = parser.parse(doc_with_tables_path)
Limitations¶
- This method is very computationally expensive. We recommend using it on a GPU.
- We currently use the table-transformers model to detect table locations. This model is not perfect and may miss some tables or crop them incorrectly. This negatively impacts the performance of unitable. We're actively looking at more robust models.