|
|
OCR - Get text |
Declaration
<AMOCR OCRENGINE="text (options)" PAGESEGMODE="text (options)" IMAGE="text" TOP="number" LEFT="number" WIDTH="number" HEIGHT="number" RESULTVARIABLE="text" EXACTCOPY="YES/NO" ALLPAGE="YES/NO" PAGE="number" ENGLISH="YES/NO" TURKISH="YES/NO" SPANISH="YES/NO" PORTUGUESE="YES/NO" FRENCH="YES/NO" RUSSIAN="YES/NO" DUTCH="YES/NO" GERMAN="YES/NO" ITALIAN="YES/NO" ICR="YES/NO" INVERT="YES/NO" />
Description
Retrieves text from an image or PDF file and populates a variable with the results.
Practical Usage
Commonly used as a way to gather text from PDF files or supported image file formats (JPG, PNG, TIFF, GIF, and BMP).
Parameters
General
| Property | Type | Required | Default | Markup | Description |
|---|---|---|---|---|---|
| OCR Engine | Text (options) | Yes | Tesseract |
|
Specifies the OCR engine to use to retrieve text from an image or PDF file. The available options are:
|
| Page Segmentation Mode | Text (options) | Yes, if the OCR Engine parameter is set to Tesseract | Single Block |
|
Specifies the page segmentation mode to use to scan the image or PDF file in a specific way. Select the option that best suits the file for a more accurate scan, based on the position of the text to retrieve. This parameter is only available and required
if the OCR Engine parameter is set to Tesseract. The available options are:
|
| Image | Text | Yes | (Empty) | IMAGE="C:\temp\Image.jpg" | Specifies the
path and file name of the image or PDF file to use with this activity. Supported image formats are JPG, PNG, TIFF, GIF, and
BMP. NOTE:
Although a variety of formats are supported, image data with lossless
compression such as TIFF is recommended. |
| Entire image/page | --- | --- | --- | --- | If
selected (default), this activity searches the entire image/page for text. NOTE: This parameter does not contain markup and is only displayed in visual mode for task construction and configuration purposes. |
| Specified region (improves accuracy) | --- | --- | --- | --- | If
selected, this activity only searches a specified region of the image or PDF file for text. Click
Pick Region to open a
dialog box that allows you to select an image region. Depending on the OCR Engine parameter's current setting, see Pick
Region Dialog Box (Tesseract OCR Engine) or Pick Region Dialog Box (Legacy OCR Engine) for more details. NOTE:
|
| Top | Number | Yes, if Specified region is selected | (Empty) | TOP="223" | The top most pixel coordinate of the specified region in the image or PDF file. This parameter is only available if the Specified region parameter is selected. |
| Left | Number | Yes, if Specified region is selected | (Empty) | LEFT="115" | The left most pixel coordinate of the specified region in the image or PDF file. This parameter is only available if the Specified region parameter is selected. |
| Width | Number | Yes, if Specified region is selected | (Empty) | WIDTH="647" | The total pixel width of the specified region in the image or PDF file. This parameter is only available if the Specified region parameter is selected. |
| Height | Number | Yes, if Specified region is selected | (Empty) | HEIGHT="647" | The total pixel height of the specified region in image or PDF file. This parameter is only available if the Specified region parameter is selected.. |
| Populate variable with OCR result text | Text | Yes | (Empty) | RESULTVARIABLE="Text" | The variable to populate with the retrieved text. |
| Exact copy (do not format text) | Yes/No | No | No | EXACTCOPY="YES" | If selected, no formatting occurs and an exact copy of the text is read. This parameter is disabled by default. |
Advanced
| Property | Type | Required | Default | Markup | Description |
|---|---|---|---|---|---|
| Page range: All | Yes/No | No | Yes | ALLPAGE="YES" | If
selected (default), text is retrieved from all pages
in a range, and the Pages parameter
is disabled. NOTE:
Only GIF images support multiple pages. |
| Page range: Page(s) | Number | No | No |
|
If
selected, text is retrieved from specific pages
in a range, and the All parameter is disabled. This parameter is disabled by default. Supports specification of a single
page, specific pages, or a sequence of pages in a range (see the
Markup column for examples). NOTE:
Only GIF images support multiple pages. |
| Languages | Yes/No | No | English |
|
Specifies which
languages are found in the image file selected for the Image parameter. The available languages are:
NOTE: This parameter resets if the OCR Engine parameter is switched to a different engine. |
| Use ICR (digits only) |
Yes/No |
No | No | ICR="YES" | If selected, ICR (Intelligent Character Recognition), a more advanced handwriting recognition system, is used to recognize numbers or digits. This parameter is disabled by default. This parameter is only available if the OCR Engine parameter is set to Legacy. |
| Invert image colors | Yes/No | No | No | INVERT="YES" | If selected, image colors are transformed from light to dark and dark to light. If this activity has trouble recognizing text, inverting can add more contrast to the text and assist in more accurate results. This parameter is only available if the OCR Engine parameter is set to Legacy |
Example
NOTE:
- The sample AML code below can be copied and pasted directly into the Steps Panel of the Task Builder.
- Parameters containing user credentials, files, file paths, and/or other information specific to the task must be customized before the sample code can run successfully.
Description
This sample task retrieves English and Spanish text from an image file, populates an existing variable with the results, and then displays the text in a dialog box.
Copy
<AMVARIABLE NAME="Results" />
<AMOCR SPANISH="YES" IMAGE="C:\temp\document_scan.png" OCRENGINE="tesseract" RESULTVARIABLE="Results" />
<AMSHOWDIALOG>%Results%</AMSHOWDIALOG>