adding explicit OCR parser config to tika-server-config-custom.xml disables working OCR image processing?

PGNet Dev Sat, 23 Jul 2022 09:59:55 -0700

I'm running tika 2.4.2/snap + tesseract5 for OCR.  Imagemagick7 is installed 
for Image proc.


it's serving as backend to a dovecot/fts-tika setup

If I exec tika with custom config

        cat /etc/tika/tika-server-config-custom.xml
                <?xml version="1.0" encoding="UTF-8"?>
                <properties>
                  <server>
                    <params>
                      <logLevel>debug</logLevel>
                      <javaPath>/usr/bin/java</javaPath>
                      <noFork>false</noFork>
                      <forkedJvmArgs>
                        <arg>-Xms1g</arg>
                        <arg>-Xmx1g</arg>
                        <arg>-Dpdfbox.fontcache=/var/tika</arg>
                      </forkedJvmArgs>
                      <digest>sha256</digest>
                      <enableUnsecureFeatures>false</enableUnsecureFeatures>
                      <id></id>
                      <maxFiles>100000</maxFiles>
                      <maxForkedStartupMillis>120000</maxForkedStartupMillis>
                      <maxRestarts>-1</maxRestarts>
                      <minimumTimeoutMillis>30000</minimumTimeoutMillis>
                      <returnStackTrace>false</returnStackTrace>
                      <taskPulseMillis>10000</taskPulseMillis>
                      <taskTimeoutMillis>300000</taskTimeoutMillis>
                      <endpoints>
                        <endpoint>tika</endpoint>
                        <endpoint>status</endpoint>
                        <endpoint>rmeta</endpoint>
                      </endpoints>
                    </params>
                  </server>
                </properties>

and pass a jpg as an email attachment, all's good

i see tesseract invoked, and after receipt & indexing by dovecot, i can exec a 
body search on OCR'd text from the image, and it's found as expected

but, if i just add a specific parser config to the above,

        cat /etc/tika/tika-server-config-custom.xml
                <?xml version="1.0" encoding="UTF-8"?>
                <properties>
+                 <parsers>
+                   <parser 
class="org.apache.tika.parser.ocr.TesseractOCRParser">
+                     <params>
+                       <param name="applyRotation" type="bool">true</param>
+                       <param name="enableImagePreprocessing" 
type="bool">true</param>
+                       <param name="maxFileSizeToOcr" 
type="long">2147483647</param>
+                       <param name="minFileSizeToOcr" type="long">0</param>
+                       <param name="preserveInterwordSpacing" 
type="bool">true</param>
+                       <param name="timeoutSeconds" type="int">180</param>
+                     </params>
+                   </parser>
+                 </parsers>
                  <server>
                    <params>
                ...

relaunch tika, and resend the attachment , i see _no_ errors, the 
attachment/email _is_ delivered,
but,
i never see tesseract invoked in top, and a search after delivery on image-text 
returns empty.
it's not in the index.

what in that additional parser config is causing the problem?

adding explicit OCR parser config to tika-server-config-custom.xml disables working OCR image processing?

Reply via email to