Using Extract text action from PDF

Extract text from a PDF file and save it as a text file by using the Extract text action.

Important:
  • If the correct fonts are not embedded in the PDF file, the Extract text action does not extract the text correctly.
  • When you use this action to extract text from a PDF file, and if that text is a single line data but the PDF file has the same data in two lines, then the data might appear in two lines.
Note: When you extract fields from a PDF that contains 20 form fields, processing time might be 30 to 40 % longer than PDFs without form fields.

Procedure

To extract text from a PDF file, perform the following steps:

  1. In the Actions palette, double-click or drag the Extract text action from the PDF package.
  2. In the PDF path, select one of the following options to specify the location of the PDF:
    • Control Room file: Enables you to select a PDF file that is available in a folder in the Control Room.
    • Desktop profile: Enables you to select a PDF file that is available on your device.
    • Variable: Enables you to specify the file variable that contains the location of the PDF file.
  3. Optional: In the User password or Owner password field, enter a password to restrict access to the encrypted PDF file.
    • User password: Allow users to perform specific operations on the encrypted PDF file.
    • Owner password: Allow users to use a password to open the file.
  4. In the Text type field, select one of the following options:
    • Plain text: Extract the text and copy it to a text file.

      This works similar to copying and pasting text from a PDF file to a text file.

    • Structured text: Preserve the original formatting of the text extracted from the PDF file.
      You can select the Reduce Data Loss option to ensure that the complete text is extracted with minimal overlap of characters. With this functionality, the number of characters overlapped by other characters is reduced.
      Note: When you select this option to extract text, the extracted text might contain extra space characters. You can choose some of the actions such as Replace or Trim from the String package to resolve such issues in the extracted PDF documents.
  5. In the Page range field, select one of the following options:
    • All pages: Enables you to save all the pages in the PDF file as an image.
    • Pages: Enables you to enter the page numbers of the pages that you want to save as an image.
  6. In the Export data to text file field, specify a name and location for the text file.
    Note: You must include the .txt extension in the name of the text file. For example, if the file name is June_Quarter_report, the .txt extension is June_Quarter_report.txt.
  7. Select the Overwrite files with the same name check box to overwrite existing files with the same name.
    Note: If this option is not selected and the bot encounters a file with the same name at the specified location, the bot will fail.
  8. Optional: From the Assign PDF properties to a dictionary variable list, select a dictionary variable to hold the file properties.
  9. Click Save.