Improve output quality using OCR confidence

Improve the output quality of the IQ Bot platform using the system-identified region (SIR) and optical character recognition (OCR) confidence by comparing it to a predefined threshold.

Confidence-based validation is useful for a text type field and for Date or Number fields because it helps route a document, with contentious values, for a human to view despite the fields satisfying the set validation criteria.

Enable OCR confidence-based validation

Note: This option is applicable only if you selected Tesseract OCR when creating the learning instance.

This feature is disabled by default. To enable this feature, open the Settings.txt configuration file available in <IQ Bot Installation Folder>\Configurations\, and set the desired threshold value in the ConfidenceThreshold property. For this example, set the character-level confidence threshold value to 99, that is ConfidenceThreshold=99. When this feature is disabled, the default value is set to 0, signifying that the feature is disabled.

Note: The confidence threshold value is uniformly applicable across all the learning instances.

How OCR confidence-based validation works

In a document if a field's SIR character level confidence is lower than that of the set confidence threshold, the validation for that field fails, resulting in the failure of that document.

Note: If a field value fails due to a validation rule (for example, Invalid Number Format) other than an OCR confidence validation failure, you see that tooltip, and not the tooltip for Low confidence.

While training a document, a confidence-based validation failure against a field appears in an orange box during preview if no other validation errors exist for that field. Other validation errors take precedence over OCR character-level confidence validation.

Troubleshoot: If the OCR engine is not able to identify SIRs for Chinese language PDF documents, troubleshoot the issue: