I'm running tika 2.4.2/snap + tesseract5 for OCR. Imagemagick7 is installed
for Image proc.
it's serving as backend to a dovecot/fts-tika setup
If I exec tika with custom config
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
<logLevel>debug</logLevel>
<javaPath>/usr/bin/java</javaPath>
<noFork>false</noFork>
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
<arg>-Dpdfbox.fontcache=/var/tika</arg>
</forkedJvmArgs>
<digest>sha256</digest>
<enableUnsecureFeatures>false</enableUnsecureFeatures>
<id></id>
<maxFiles>100000</maxFiles>
<maxForkedStartupMillis>120000</maxForkedStartupMillis>
<maxRestarts>-1</maxRestarts>
<minimumTimeoutMillis>30000</minimumTimeoutMillis>
<returnStackTrace>false</returnStackTrace>
<taskPulseMillis>10000</taskPulseMillis>
<taskTimeoutMillis>300000</taskTimeoutMillis>
<endpoints>
<endpoint>tika</endpoint>
<endpoint>status</endpoint>
<endpoint>rmeta</endpoint>
</endpoints>
</params>
</server>
</properties>
and pass a jpg as an email attachment, all's good
i see tesseract invoked, and after receipt & indexing by dovecot, i can exec a
body search on OCR'd text from the image, and it's found as expected
but, if i just add a specific parser config to the above,
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
+ <parsers>
+ <parser
class="org.apache.tika.parser.ocr.TesseractOCRParser">
+ <params>
+ <param name="applyRotation" type="bool">true</param>
+ <param name="enableImagePreprocessing"
type="bool">true</param>
+ <param name="maxFileSizeToOcr"
type="long">2147483647</param>
+ <param name="minFileSizeToOcr" type="long">0</param>
+ <param name="preserveInterwordSpacing"
type="bool">true</param>
+ <param name="timeoutSeconds" type="int">180</param>
+ </params>
+ </parser>
+ </parsers>
<server>
<params>
...
relaunch tika, and resend the attachment , i see _no_ errors, the
attachment/email _is_ delivered,
but,
i never see tesseract invoked in top, and a search after delivery on image-text
returns empty.
it's not in the index.
what in that additional parser config is causing the problem?