Data extraction in Document Automation
Understand how documents are processed in Document Automation.
Improving extraction accuracy through validation
The following graphic provides a visual overview of the process by which learning instances continuously receive feedback from validation:
- An uploaded document passes through the extraction engine.
- If the learning instance successfully extracts the data, the document is
added to the straight-through processing (STP) count and the extracted
values are downloaded to a file in the
If the learning instance can not extract the data, the system evaluates whether the document contains an unfamiliar layout.
- If the learning instance does not recognize the document layout (new layout), the document is sent for manual validation where the user "teaches" the learning instance how to extract the data by setting the extraction region.
- The extracted values are downloaded to a file in the
Successfolder and the changes are collected in a feedback file, which is sent to the feedback database.Nota:
- Feedback is only collected when the user changes the extraction region. If the user manually inputs text, the system does not collect feedback.
- The feedback file only contains data on the field location to improve extraction accuracy for subsequent documents.
If the learning instance recognizes the cluster, it retrieves previous feedback from the feedback database and uses it to extract data.
How Document Automation identifies new layouts
Document Automation extraction is based on object detection. During document processing, the extraction engine identifies objects, or key-value pairs of the field and associated value. The engine creates a "fingerprint" of the document, which stores the sequence of the objects and each object's location in the document.
When a document is processed, if the engine recognizes the keys and their locations, the document is classified and extracted based on that existing fingerprint. Otherwise, the engine saves a new fingerprint of the keys and their locations.