on startup of
tika tika-server-standard-2.4.2-20220723.145242-114.jar
, mod'd to enable debug logs,
with config
...
<properties>
<parsers>
<!--
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
</parser>
-->
...
systemctl restart tika
journalctl -f -u tika | grep -i tesseract
Jul 23 14:07:18 mx-test tika[40896]: TRACE StatusLogger
Log4jLoggerFactory.getContext() found anchor class
org.apache.tika.parser.ocr.TesseractOCRParser
Jul 23 14:07:18 mx-test tika[40896]: TRACE StatusLogger
Log4jLoggerFactory.getContext() found anchor class
org.apache.tika.parser.ocr.TesseractOCRConfig
Jul 23 14:07:18 mx-test tika[40896]: DEBUG [main] 14:07:18,257
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:07:18 mx-test tika[40896]: DEBUG [main] 14:07:18,259
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]):
true
Jul 23 14:07:19 mx-test tika[40896]: DEBUG [main] 14:07:19,979
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:07:19 mx-test tika[40896]: DEBUG [main] 14:07:19,980
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]):
true
Jul 23 14:07:20 mx-test tika[40896]: DEBUG [main] 14:07:20,140
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:07:20 mx-test tika[40896]: DEBUG [main] 14:07:20,141
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]):
true
on receipt of email+img attach, tesseract IS invoked
Jul 23 14:18:58 mx-test tika[41388]: INFO [qtp444127949-31]
14:18:58,527 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is
installed and is being invoked. This can add greatly to processing time. If
you do not want tesseract to be applied to your files see:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
Jul 23 14:18:58 mx-test tika[41388]: DEBUG [qtp444127949-31]
14:18:58,530 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract command:
tesseract /tmp/apache-tika-5913626684069109310.tmp
/tmp/apache-tika-17854463223950821902.tmp --psm 1 -l eng -c page_separator= -c
preserve_interword_spaces=0 txt
Jul 23 14:19:04 mx-test tika[41388]: DEBUG [Thread-23] 14:19:04,973
org.apache.tika.parser.ocr.TesseractOCRParser
Jul 23 14:19:04 mx-test tika[41388]: DEBUG [Thread-24] 14:19:04,973
org.apache.tika.parser.ocr.TesseractOCRParser Estimating resolution as 304
, and the parsed image result _is_ passed back to dovecot, where it's correctly
indexed, and embedded terms are searchable
otoh, with config
...
<properties>
<parsers>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
</parser>
...
on startup, same
Jul 23 14:15:32 mx-test tika[41205]: TRACE StatusLogger
Log4jLoggerFactory.getContext() found anchor class
org.apache.tika.parser.ocr.TesseractOCRParser
Jul 23 14:15:32 mx-test tika[41205]: TRACE StatusLogger
Log4jLoggerFactory.getContext() found anchor class
org.apache.tika.parser.ocr.TesseractOCRConfig
Jul 23 14:15:32 mx-test tika[41205]: DEBUG [main] 14:15:32,685
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:15:32 mx-test tika[41205]: DEBUG [main] 14:15:32,686
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]):
true
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,472
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,473
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]):
true
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,631
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,632
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]):
true
but, on receipt of email+img attach,
(empty)