Improve table data extraction

Use the advanced training settings to train your documents and provide additional inputs to Document Automation extraction engine to improve the table data extraction.

After extracting the document, you can use the Advanced training setting option on the validation page to set the following values:
  • Primary column: Set the primary column for row identification based on your requirements.
  • End of table indicator: Add an end of table indicator value for the system to extract data till the value reaches to the specified value, excluding the end of table indicator value.
  • Header labels: Adjust or re-map the table fields as required.
Note: This feature is only applicable to providers only if the Improve accuracy using validation option is available.

Prerequisites

  • The Advanced training setting option is available only if the Improve accuracy using validation option is enabled.
  • Ensure that you have the Train groups permission to provide information about header labels, end of table indicator, and a primary column used for row detection.
  • There can be only one primary column.
  • The end of table indicator is a text system-identified region (SIR).

Procedure

  1. Process a document and navigate to the validation page.
  2. Click Advanced training settings.

    Advanced training settings option in validator page
  3. Train your document to set the following values:
    1. Set the user-defined primary column for row identification.

      Setting primary column using advanced training settings

      When you specify this value for the first time, the next time you process this document again or documents of similar type, this value is automatically updated.

      To clear the automatically updated value, click the drop-down menu and select the empty value from the drop-down menu.

    2. Specify the end of table indicator value.

      Specifying end of table indicator for extracting data excluding the EoT text

      When you specify this value for the first time, the next time you process this document again or documents of similar type, this value is automatically updated even when the indicator is at random locations on the document.

      If a document does not have this value, it will still be automatically updated. However, there will be no corresponding System Identified Region (SIR) on the document as the value is missing.

      To clear the automatically updated value, click the close button next to the value in the end of table indicator field or on the selection box of the value on the document.

    3. Click the required column and specify the required header name.

      Changing header value of the columns
  4. Click Submit and re-process the document.
    Note: You must click Submit to save and take these settings into effect while reprocessing the document.
    Based on the specified advanced training settings, the document is reprocessed and either sent to validator again to validate fields, if any or the data is extracted in the Success folder as CSV file.

Primary column

For example, after extracting the document, the multi-line table data from Item number column is extracted in a single row but you want to extract it in separate rows. In such cases, you can set the Item number as primary column to improve table extraction. For more details, see Example of setting primary column using advanced training settings.

End of table indicator

For example, when you process a document, it extracts entire table data where as you want to extract row data till Total payable. In such cases, you can specify the End of table indicator value so that table data till that value (excluding the End of table indicator value) will be extracted and no further row data will be extracted.

Header label

When there is a label mismatch in table data, for example the extracted header label is Unit Price but you want the header label as Price. In such cases, you can change the header label.

Another use case is you can re-map all values of Unit Price or change the header label along with the column data. You can use auto-fill to expedite this re-mapping. For example, after extraction, the Price column from learning instance is extracted as Extended Price but you want the header label as Unit Price along with it's column data. In such cases, you can change the Extended Price header label to Unit Price and you must select and re-map all the cell values from the Unit Price column.


Changing header label to get the required header along with column data
The following video shows an example of setting the Item number as the primary column and extracting the data in a separate row instead of a single cell.