Set up rules for classifying documents or pages

This topic describes about the capability to set up rules for classifying documents or pages.

Understanding rules and their usage

A rule is used to determine the category to which a document should be assigned. For the below example, a rule can specify a particular phrase like Annexure and when the rule is applied to the document text, if that phrase is found, the category associated with the rule will be assigned a high score, such as 80, indicating a strong match.

[
{
"DocumentTypeID": 0,
"Location": 0,
"Distance": 1,
"Score": 80,
"KBGuid": "00000000-0000-0000-0000-000000000000",
"IsEnabled": true,
"ExpectExactSequence": false,
"TextRulePhrases": [
{
"Text": "Annexure",
"IsNegativePhrase": false,
"PhraseType": 1
}
}
]

Rules are useful when additional guidance is needed to enhance the accuracy of a classification model in determining the most relevant document category. While technically it is possible to do all classification using a rules, it is not the best practice as the management of rules configuration becomes a significant overhead overtime especially when dealing with large number of categories.

Example of a rule file

A rule file is a json schema and in json format as specified below. The <filename>.json can have multiple rules setup. For a document to be considered during the training process for classification, it must be placed in a designated training folder category, for example, C:\Invoice\Vendor1.

[
{
"DocumentTypeID": 0,
"Location": 1,
"Distance": 3,
"Score": 90,
"KBGuid": "00000000-0000-0000-0000-000000000000",
"IsEnabled": true,
"ExpectExactSequence": true,
"TextRulePhrases": [
{
"Text": "Annexure",
"IsNegativePhrase": false,
"PhraseType": 1
},
{
"Text": "Terms & Conditions",
"IsNegativePhrase": false,
"PhraseType": 1
},
{
"Text": "Payment Terms",
"IsNegativePhrase": false,
"PhraseType": 1
}
]
},
{
"DocumentTypeID": 2,
"Location": 2,
"Distance": 1,
"Score": 95,
"KBGuid": "00000000-0000-0000-0000-000000000000",
"IsEnabled": true,
"ExpectExactSequence": false,
"TextRulePhrases": [
{
"Text": "Addendum",
"IsNegativePhrase": true,
"PhraseType": 5
}
]
}
]

Configurable properties of a rule file

Configuration Description
DocumentTypeID Currently, this field is not supported. For any rule being setup it can be kept static text as 0.
Location
This configuration specifies which location of the document text the rule applicable. The values can be 0, 1, 2, or 3.
  • "Location": 0 Any Location: There are no restrictions for finding the phrase, it can be anywhere on the document

  • "Location": 1 First Page: The phrase must be found on the first page of the document.

  • "Location": 2 Inside Caption: The phrase must be found on inside caption text of the document.
  • "Location": 3 Last Page: The phrase must be found on the last page of the document. If only a few pages are provided to the text rule classifier, the last page is the one that will be passed to the classifier..
Distance
This configuration specifies the distance between phrases when the look-up is done on the document text. The rule will only match if the distance is as specified basis this configuration. The values can be 0, 1, 2, or 3.
  • "Distance": 0 Same Text Line:All phrases must be in the same text line.
  • "Distance": 1 Next Text Line: All phrases must be in the same text line or have a maximum of one line-break in between each other.
  • "Distance": 2 Same Paragraph: All phrases must be within the same paragraph text.
  • "Distance": 3 Same Page: All phrases must be on the same page.
Score After a rule match is performed, a score is assigned to the category (or training folder) associated with that rule. The score value can range from -100 to 100.
KBGuid Currently, this field is not supported. For any rule being setup it can be kept static text as 00000000-0000-0000-0000-000000000000
IsEnabled This allows rule to be enabled or disabled by settingtrue OR false respectively.
ExpectExactSequence
When looking up multiple phrases in a rule, this configuration specifies exact sequence based matching. For example, if set true in the example,"Text": "Annexure", "Text": "Terms & Conditions", and "Payment Terms" must be present in the document text in this order for the rule to match. It is possible for other text to exist between these phrases, but it is important that the order of these phrases is consecutive, with one following the other.
Note: Unless its very clear that the expected sequence will follow a specific pattern its recommended to keep this configuration as false

TextRulePhrases

The TextRulePhrases contains all the phrase text values that needs to be looked up against the document text. It can have one or more phrase text values.

Text Text - specifies the phrase text value that needs to be looked up against the document text
IsNegativePhrase IsNegativePhrase- specifies whether the lookup condition is a negative phrase type of lookup. When set true in the example, this will mean that "Text": "Addendum"is not present in the document text for the rule to match.
PhraseType PhraseType specifies type of match that will be used when hrase text value is looked up against the document text

"PhraseType": 1 Fuzzy Matching allows you to use some tolerance in word matching. A phrase is matched if the Levenshtein distance is >= 80%. For a 5 character word this means 1 character can be different.

"PhraseType": 5 Exact Matching requires an exact match of the phrase but still ignores character casing and also includes some filtering of punctuations.

"PhraseType": 6 RegularExpression allows you to define a regular expression that is then searched in the original text. No text pre-processing or filtering of punctuation characters is done for this type of matching.