Choosing an extraction model
- Updated: 2024/10/03
Choosing an extraction model
Document Automation supports different extraction providers to support a wide range of document processing use cases. Determining which extraction provider to use for processing documents might require you to benchmark different extraction providers or choose a combination of more than one extraction provider to address a particular use case.
For example, for loan application packets, you might want to use the Automation Anywhere extraction provider for extracting certain data such as W-2 forms and bank statements and the Google Document AI extraction provider for extracting certain data such as invoice and identity documents. In such a scenario, data extraction using only one of the extraction providers does not provide complete coverage.
One critical input for deciding on an extraction provider is the type of document you want to process: structured, semi-structured, or unstructured. For information about document types, see Document types.
Structured documents
For structured documents that follow a consistent structure and clear layout, we recommend using the Standard Forms extraction model in Document Automation for data extraction. This model uses a combination of optical character recognition (OCR) capabilities with a template-based model to extract key-value pairs and table data from very consistently formatted structured documents like forms or IDs. See Create custom models in Document Automation using Standard Forms.
Semi-structured documents
Semi-structured documents often require testing and validation of different extraction models and providers to determine the combination that will deliver the required data. Some use cases might require creating more than one learning instance with different combinations of extraction models and providers for extracting the required data from fields and tables. This model uses a combination of OCR capabilities with keyword-based extraction, regular expressions, and validation feedback to extract key-value pairs and table data from a wide range of formats.
Document type | Extraction providers | |
---|---|---|
Automation Anywhere | Google Document AI | |
Invoices | Yes | Yes |
Arrival Notice | Yes | No |
Bill of Lading | Yes | No |
Packing List | Yes | No |
Receipts | No | Yes |
User-defined | Yes | Yes |
Utility Bill | No | Yes |
Waybill | Yes | No |
Unstructured documents
For unstructured documents that lack a standard format, fixed layout, or lack of data without labels like contracts, we recommend using the unstructured document extraction model in Document Automation for data extraction. Extraction for unstructured documents relies on generative AI models which can understand semantic meaning and analyze complex document formats.