on startup of

        tika tika-server-standard-2.4.2-20220723.145242-114.jar

, mod'd to enable debug logs,

with config

        ...
        <properties>
          <parsers>
        <!--
            <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            </parser>
        -->
          ...

systemctl restart tika
journalctl -f -u tika | grep -i tesseract

        Jul 23 14:07:18 mx-test tika[40896]: TRACE StatusLogger 
Log4jLoggerFactory.getContext() found anchor class 
org.apache.tika.parser.ocr.TesseractOCRParser
        Jul 23 14:07:18 mx-test tika[40896]: TRACE StatusLogger 
Log4jLoggerFactory.getContext() found anchor class 
org.apache.tika.parser.ocr.TesseractOCRConfig
        Jul 23 14:07:18 mx-test tika[40896]: DEBUG [main] 14:07:18,257 
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
        Jul 23 14:07:18 mx-test tika[40896]: DEBUG [main] 14:07:18,259 
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): 
true
        Jul 23 14:07:19 mx-test tika[40896]: DEBUG [main] 14:07:19,979 
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
        Jul 23 14:07:19 mx-test tika[40896]: DEBUG [main] 14:07:19,980 
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): 
true
        Jul 23 14:07:20 mx-test tika[40896]: DEBUG [main] 14:07:20,140 
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
        Jul 23 14:07:20 mx-test tika[40896]: DEBUG [main] 14:07:20,141 
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): 
true

on receipt of email+img attach, tesseract IS invoked

        Jul 23 14:18:58 mx-test tika[41388]: INFO  [qtp444127949-31] 
14:18:58,527 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is 
installed and is being invoked. This can add greatly to processing time.  If 
you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
        Jul 23 14:18:58 mx-test tika[41388]: DEBUG [qtp444127949-31] 
14:18:58,530 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract command: 
tesseract /tmp/apache-tika-5913626684069109310.tmp 
/tmp/apache-tika-17854463223950821902.tmp --psm 1 -l eng -c page_separator= -c 
preserve_interword_spaces=0 txt
        Jul 23 14:19:04 mx-test tika[41388]: DEBUG [Thread-23] 14:19:04,973 
org.apache.tika.parser.ocr.TesseractOCRParser
        Jul 23 14:19:04 mx-test tika[41388]: DEBUG [Thread-24] 14:19:04,973 
org.apache.tika.parser.ocr.TesseractOCRParser Estimating resolution as 304

, and the parsed image result _is_ passed back to dovecot, where it's correctly 
indexed, and embedded terms are searchable

otoh, with config

        ...
        <properties>
          <parsers>
            <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            </parser>
          ...

on startup, same

        Jul 23 14:15:32 mx-test tika[41205]: TRACE StatusLogger 
Log4jLoggerFactory.getContext() found anchor class 
org.apache.tika.parser.ocr.TesseractOCRParser
        Jul 23 14:15:32 mx-test tika[41205]: TRACE StatusLogger 
Log4jLoggerFactory.getContext() found anchor class 
org.apache.tika.parser.ocr.TesseractOCRConfig
        Jul 23 14:15:32 mx-test tika[41205]: DEBUG [main] 14:15:32,685 
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
        Jul 23 14:15:32 mx-test tika[41205]: DEBUG [main] 14:15:32,686 
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): 
true
        Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,472 
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
        Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,473 
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): 
true
        Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,631 
org.apache.tika.parser.external.ExternalParser exit value for tesseract: 1
        Jul 23 14:15:34 mx-test tika[41205]: DEBUG [main] 14:15:34,632 
org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path: [tesseract]): 
true

but, on receipt of email+img attach,

        (empty)

Reply via email to