Jeremy Dyer created NIFI-1815:
---------------------------------

             Summary: Tesseract OCR Processor
                 Key: NIFI-1815
                 URL: https://issues.apache.org/jira/browse/NIFI-1815
             Project: Apache NiFi
          Issue Type: Improvement
            Reporter: Jeremy Dyer
            Assignee: Jeremy Dyer


This ticket is a follow-up to NIFI-1718 minus the use of the Tika library

Expose OCR capabilities through a new processor which uses the Tesseract 
library. Use of this processor would require that Tesseract be installed on the 
NiFi host. Since the processor will have a system dependency care must be taken 
to ensure that the overall NiFi cluster continues to function properly in the 
absence of the Tesseract system dependency even though the OCR processor itself 
will be unable to perform its duties. In the event that the system dependencies 
are not detected the processor should display a validation warning rather than 
failing or preventing the NiFi instance from booting properly.

Properties expose to configure Tesseract
tesseractPath - Path to tesseract installation folder, if not on system path.
language - Language ID (e.g. "eng"); language dictionary to be used.
pageSegMode - Tesseract page segmentation mode, defaults to 1.
minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
Integer.MAX_VALUE.
timeout - Maximum time (in seconds) to wait for the OCR process termination; 
defaults to 120.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to