removing dovecot from the equation, reduced this to just tika, reproducible here

running

        ls -al /srv/tika/tika-server.jar
                lrwxrwxrwx 1 root root 50 Jul 26 05:42 /srv/tika/tika-server.jar 
-> tika-server-standard-2.4.2-20220725.215245-121.jar

        systemctl status tika -ln0
                ● tika.service - Apache Tika server
                     Loaded: loaded (/etc/systemd/system/tika.service; enabled; 
vendor preset: disabled)
                     Active: active (running) since Tue 2022-07-26 05:43:01 
EDT; 29min ago
                   Main PID: 10829 (java)
                      Tasks: 53 (limit: 8812)
                     Memory: 215.9M
                        CPU: 37.667s
                     CGroup: /system.slice/tika.service
                             ├─ 10829 /usr/bin/java 
-Dpdfbox.fontcache=/var/tika -XX:ParallelGCThreads=1 -XX:CICompilerCount=2 
-XX:-CICompilerCountPerCPU -jar /srv/tika/tika-server.jar -c 
/etc/tika/tika-server-config-custom.xml --host 127.0.0.1 --port 9998
                             └─ 10863 /usr/bin/java -Xms1g -Xmx1g 
-Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp 
/srv/tika/tika-server.jar -Dtika.server.id= org.apache.tika.server.core.TikaServerProcess 
-h 127.0.0.1 -p 9998 -i "" -c /etc/tika/tika-server-config-custom.xml 
-forkedStatusFile /tmp/apache-tika-server-forked-tmp-12945021525641519393 -numRestarts 0

on

        lsb_release -rd
                Description:    Fedora release 36 (Thirty Six)
                Release:        36

with

        tesseract --version
                tesseract 5.0.1
                 leptonica-1.82.0
                  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 
1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.3
                 Found OpenMP 201511

        stream --version
                Version: ImageMagick 7.1.0-44 Q16-HDRI x86_64 20294 
https://imagemagick.org
                Copyright: (C) 1999 ImageMagick Studio LLC
                License: https://imagemagick.org/script/license.php
                Features: Cipher DPC HDRI Modules OpenMP(4.5)
                Delegates (built-in): bzlib cairo djvu fontconfig freetype 
gslib gvc heic jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png ps 
raqm raw rsvg tiff webp wmf x xml zip zlib
                Compiler: gcc (12.1)

        java -version
                Picked up JAVA_TOOL_OPTIONS: -Xmx512M
                openjdk version "18.0.1.1" 2022-04-22
                OpenJDK Runtime Environment 22.3 (build 18.0.1.1+2)
                OpenJDK 64-Bit Server VM 22.3 (build 18.0.1.1+2, mixed mode, 
sharing)

& custom config

        cat /etc/tika/tika-server-config-custom.xml
                <?xml version="1.0" encoding="UTF-8"?>
                <properties>
                  <parsers>
                  </parsers>
                  <server>
                    <params>
                      <logLevel>debug</logLevel>
                      <javaPath>/usr/bin/java</javaPath>
                      <noFork>false</noFork>
                      <forkedJvmArgs>
                        <arg>-Xms1g</arg>
                        <arg>-Xmx1g</arg>
                        <arg>-Dpdfbox.fontcache=/var/tika</arg>
                        <arg>-Dlog4j2.debug</arg>
                      </forkedJvmArgs>
                      <digest>sha256</digest>
                      <enableUnsecureFeatures>false</enableUnsecureFeatures>
                      <id></id>
                      <maxFiles>100000</maxFiles>
                      <maxForkedStartupMillis>120000</maxForkedStartupMillis>
                      <maxRestarts>-1</maxRestarts>
                      <minimumTimeoutMillis>30000</minimumTimeoutMillis>
                      <returnStackTrace>false</returnStackTrace>
                      <taskPulseMillis>10000</taskPulseMillis>
                      <taskTimeoutMillis>300000</taskTimeoutMillis>
                      <endpoints>
                        <endpoint>tika</endpoint>
                        <endpoint>status</endpoint>
                        <endpoint>rmeta</endpoint>
                      </endpoints>
                    </params>
                  </server>
                </properties>

on exec, passing a test pdf,

        curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta

complete metadata's returned

                <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 
Test.SNAPSHOT">
                  <rdf:RDF 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
                    <rdf:Description rdf:about=""
                        xmlns:pdf="http://ns.adobe.com/pdf/1.3/";
                        xmlns:xmp="http://ns.adobe.com/xap/1.0/";
                        xmlns:dc="http://purl.org/dc/elements/1.1/";
                        xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/";
                        xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/";
                      pdf:PDFVersion="1.7"
                      pdf:hasXFA="false"
                      pdf:hasCollection="false"
                      pdf:encrypted="false"
                      pdf:hasMarkedContent="false"
                      pdf:producer="Adobe PDF Library 15.0"
                      pdf:hasXMP="true"
                      xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
                      xmp:CreateDate="2020-10-14T17:08:10Z"
                      xmp:ModifyDate="2020-10-14T17:08:10Z"
                      xmp:MetadataDate="2020-10-14T17:08:10Z"
                      dc:format="application/pdf; version=1.7"
                      dc:language="en-US"
                      
xmpMM:DocumentID="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"
                      xmpTPg:NPages="1">
                      <pdf:unmappedUnicodeCharsPerPage>
                        <rdf:Seq>
                          <rdf:li>0</rdf:li>
                        </rdf:Seq>
                      </pdf:unmappedUnicodeCharsPerPage>
                      <pdf:charsPerPage>
                        <rdf:Seq>
                          <rdf:li>794</rdf:li>
                        </rdf:Seq>
                      </pdf:charsPerPage>
                      <pdf:annotationTypes>
                        <rdf:Bag>
                          <rdf:li>95e8dd6e9b4c5a3d-3d44cd989a3a348c</rdf:li>
                          <rdf:li>95e8dd6f9b4c5a3e-3d44cd979a3a348b</rdf:li>
                          <rdf:li>95e8dd709b4c5a3f-3d44cd969a3a348a</rdf:li>
                          <rdf:li>95e8dd719b4c5a40-3d44cd959a3a3489</rdf:li>
                          <rdf:li>95e8dd729b4c5a41-3d44cd949a3a3488</rdf:li>
                        </rdf:Bag>
                      </pdf:annotationTypes>
                      <pdf:annotationSubtypes>
                        <rdf:Bag>
                          <rdf:li>Link</rdf:li>
                        </rdf:Bag>
                      </pdf:annotationSubtypes>
                    </rdf:Description>
                  </rdf:RDF>
                </x:xmpmeta>

if i add TesseractOCRParser class config to the above, for simple param override

        cat /etc/tika/tika-server-config-custom.xml
                <?xml version="1.0" encoding="UTF-8"?>
                <properties>
                  <parsers>
+                       <parser 
class="org.apache.tika.parser.ocr.TesseractOCRParser">
+                         <params>
+                           <param name="timeoutSeconds" type="int">180</param>
+                         </params>
+                       </parser>
                  </parsers>
                ...

exec

        systemctl restart tika
        curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta

returns incomplete/truncated data

                <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 
Test.SNAPSHOT">
                  <rdf:RDF 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
                    <rdf:Description rdf:about=""/>
                  </rdf:RDF>
                </x:xmpmeta>

Reply via email to