Using Train Classifier action

Use the Train Classifier action to create a model file that is used by the Classify action to sort the documents into required categories for input.

Prerequisites

Before building the bot, collect example documents and categorize them into folders. Ensure the set of example documents meets the following requirements:

  • Has at least two categories.
  • A minimum of 15 pages per category (20 pages recommended).
  • Split input PDF documents that have multiple pages into single-page PDF documents. See Using the Split document action.

    For example, if you have a PDF document that has three pages, split it into three single-page PDF documents.

If these minimum requirements are not met, an error message is displayed during bot runtime.

Each folder has a selection of documents that are a sample of the documents that the associated learning instance will process. The Train Classifier action will read through the files in the folders, and build a model based on the documents stored inside each folder.
Note: As ABBYY FineReader Engine OCR is now downgraded to version 12.2 from version 12.4, older .icmf files cannot be used to retrain models in Automation 360 v.24 of the Document Classifier package. If you want to add more categories or more files into your existing categories, you must create a new model.

Procedure

  1. In the Actions palette, double-click or drag the Train Classifier action from the Document Classifier package.
  2. Click Train to continue creating a new model file.
  3. Optional: If you have an existing model file, click Re-Train.
    1. Use the Training folder path field to select an existing folder path from the Desktop folder tab.
      Alternatively, click the Variable tab to manually enter an existing training folder path.
    2. Use the Existing zip path field to select the filepath of the .zip folder from Control Room file or Desktop file tab.
      Alternatively, click the Variable tab to manually enter the path for the .zip folder.
      Note: When you train documents, a .zip folder is created, which contains .icmf, .data and .properties files. Ensure you upload the entire .zip folder for retraining an existing model file.
  4. Select the input folder path from Desktop folder or Variable.

    The input folder path must have subdirectories with the names that correspond to the category of the documents that you want to train the classifier on. For example, if you have sales-related documents, the input folder path must have subfolders such as Invoice and Purchase Order.

  5. Optional: If you select Desktop file, click Browse to change the default filepath.
  6. Enter a name for the model file in the Model name field.
  7. Use the Model output path field to select the directory for the output model file.
  8. Optional: Configure the following ADVANCED SETTINGS:
    1. Training Optimization: Use the drop-down menu to select the type of training optimization.
      • Precision: select this option when you want your training model to be precise but can miss out on few documents.
      • Recall: select this option when you want the training model to find all the relevant cases within a dataset.
      • F1 score: is selected by default and the recommended setting as it combines the training optimization of both Precision and Recall.

      F1 score is the selected by default. Precision and Recall.

    2. Classification Type: Use the drop-down menu to select the features you want to include such as text, image, or both.

      Text and image is selected by default. If you select Text or Text and image, list of supported languages is displayed in the Recognition Language drop-down menu.

    3. OCR Settings: The Extract all text blocks and Extract text from images are enabled by default.

      With the OCR Settings enabled by default, more time is consumed by OCR in extracting the content. This ensures that relatively lower quality documents are also handled based on the inputs from OCR.

  9. Click Save and Run.
    When you retrain an existing model, you fetch the already trained data and combine it with new data generated from the text or layout features from input documents. After this, you must train the machine learning model from scratch. This method allows you to save the time needed to re-generate text data or layout data for already trained documents. However, the computationally expensive part is training the machine learning model, hence re-train method is expected to be time-consuming. In case this becomes a constraint, we recommend that you create additional model files and use them for additional training and classification.
    The model is created as a .icmf file in the directory specified in the Model output path field.

Next steps

After creating the model, build a bot to classify input documents. See Using Classify action