Choosing an extraction model

Document Automation supports different extraction providers to support a wide range of document processing use cases. Determining which extraction provider to use for processing documents might require you to benchmark different extraction providers or choose a combination of more than one extraction provider to address a particular use case.

For example, for loan application packets, you might want to use the Automation Anywhere extraction provider for extracting certain data such as W-2 forms and bank statements and the Google Document AI extraction provider for extracting certain data such as invoice and identity documents. In such a scenario, data extraction using only one of the extraction providers does not provide complete coverage.

One critical input for deciding on an extraction provider is the type of document you want to process: structured, semi-structured, or unstructured. For information about document types, see Document types.

Structured documents

For structured documents that follow a consistent structure and clear layout, we recommend using the Standard Forms extraction model in Document Automation for data extraction. This model uses a combination of optical character recognition (OCR) capabilities with a template-based model to extract key-value pairs and table data from very consistently formatted structured documents like forms or IDs. See Create custom models in Document Automation using Standard Forms.

Semi-structured documents

Semi-structured documents often require testing and validation of different extraction models and providers to determine the combination that will deliver the required data. Some use cases might require creating more than one learning instance with different combinations of extraction models and providers for extracting the required data from fields and tables. This model uses a combination of OCR capabilities with keyword-based extraction, regular expressions, and validation feedback to extract key-value pairs and table data from a wide range of formats.

The following table lists the different pre-trained extraction models and providers available in Document Automation for processing semi-structured documents. The availability of extraction models depends on the language you select. When an extraction model supports both Automation Anywhere and Google Document AI extraction providers, you might sometimes want to compare the two to see which is better for use case or even use the two in conjunction if necessary to extract all relevant data.
Note: Use the generic model (User-defined) if you do not see the model you want to use available in the pre-trained extraction models list.
Document type Extraction providers
Automation Anywhere Google Document AI
Invoices Yes Yes
Arrival Notice Yes No
Bill of Lading Yes No
Packing List Yes No
Receipts No Yes
User-defined Yes Yes
Utility Bill No Yes
Waybill Yes No
Using the user-defined document type

Unstructured documents

For unstructured documents that lack a standard format, fixed layout, or lack of data without labels like contracts, we recommend using the unstructured document extraction model in Document Automation for data extraction. Extraction for unstructured documents relies on generative AI models which can understand semantic meaning and analyze complex document formats.

Note: For even more flexibility, third-party parsers can also be integrated using the Configure Parser feature in addition to the options outlined above. See Integrate third-party parser in learning instance