Using the unstructured document type

You can use the unstructured document type to extract data from unstructured documents that lack a standard format, fixed layout, or lack data without labels.

The model uses a combination of OCR capabilities with natural language processing (NLP) and generative AI technologies to perform semantic analysis and to extract key-value pairs and table data from unstructured documents.

The following are some of the examples of unstructured documents:

  • Legal documents
  • Correspondence (including emails)
  • Reports

This model provides option to select between the following generative AI providers:

OpenAI
Using this option provides the following capabilities:
  • Handle a wide range of tasks
  • Handle documents in both English and other languages
  • Support multimodal capabilities
  • Fine-tuning capabilities for certain models
Anthropic
Using this option provides the following capabilities:
  • Efficient processing of large, unstructured documents
  • Handle documents in both English and other languages
  • Faster processing of documents with better data extraction accuracy

Generative AI providers provide generalized intelligence which means that there is no specific training of the learning instance or model required for different document types. Instead, when configuring a learning instance, users should optimize query prompts to identify and define how data must be extracted from documents. For example, you can define the following sample prompts to retrieve specific data from contracts and agreements:

  • What is the effective date of the contract?
  • What is the reference number?
  • What is the effective date of the contract? Return the answer in MM/DD/YYYY format.
  • What is the reference number? It should follow this pattern AAA-12345.
  • Are there unpaid taxes as of the effective date of the agreement? Reply yes or no.
    Note: If the query prompt is empty, the extraction results or the output will be empty. You might sometimes want to keep a prompt empty as placeholders for data when your workflow involves post-processing data. For example, if you want to retrieve data from a database and use it in the field for comparison.

System-defined form and table fields are not available as the unstructured document type does not use a standard format, fixed layout, or lacks data without labels. You must define all the form and table fields that require data extraction when you configure a learning instance.

For customers wishing to use private Cloud instances of generative AI models on Microsoft Azure, AWS, or GCP, they can connect to models in their private Cloud. See Connect your own generative AI services.

Note: The validation feedback option is not available in this model.