I'd stared repeatedly at

        <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>

in the docs.  seemed reasonable that since TesseractOCRParser *is* the default 
parser, exlcuding it made no sense.

guess not!

with config,

        <?xml version="1.0" encoding="UTF-8"?>
        <properties>
          <parsers>
+           <parser class="org.apache.tika.parser.DefaultParser">
+             <parser-exclude 
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
+           </parser>
        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
          <params>
                <param name="skipOcr" type="bool">false</param>
                <param name="tessdataPath" 
type="string">/usr/share/tesseract/tessdata</param>
                <param name="tesseractPath" type="string">/usr/bin</param>

                <param name="maxFileSizeToOcr" type="long">2147483647</param>
                <param name="minFileSizeToOcr" type="long">0</param>

                <param name="applyRotation" type="bool">true</param>
                <param name="enableImagePreprocessing" type="bool">true</param>
                <param name="preserveInterwordSpacing" type="bool">true</param>
                <param name="timeoutSeconds" type="int">180</param>
              </params>
            </parser>
          </parsers>
          <server>
curl correctly returns

        curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
                <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 
Test.SNAPSHOT">
                  <rdf:RDF 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
                    <rdf:Description rdf:about=""
                        xmlns:pdf="http://ns.adobe.com/pdf/1.3/";
                        xmlns:xmp="http://ns.adobe.com/xap/1.0/";
                        xmlns:dc="http://purl.org/dc/elements/1.1/";
                        xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/";
                        xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/";
                      pdf:PDFVersion="1.7"
                      pdf:hasXFA="false"
                      pdf:hasCollection="false"
                      pdf:encrypted="false"
                      pdf:hasMarkedContent="false"
                      pdf:producer="Adobe PDF Library 15.0"
                      pdf:hasXMP="true"
                      xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
                      xmp:CreateDate="2020-10-14T17:08:10Z"
                      xmp:ModifyDate="2020-10-14T17:08:10Z"
                      xmp:MetadataDate="2020-10-14T17:08:10Z"
                      dc:format="application/pdf; version=1.7"
                      dc:language="en-US"
                      
xmpMM:DocumentID="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"
                      xmpTPg:NPages="1">
                      <pdf:unmappedUnicodeCharsPerPage>
                        <rdf:Seq>
                          <rdf:li>0</rdf:li>
                        </rdf:Seq>
                      </pdf:unmappedUnicodeCharsPerPage>
                      <pdf:charsPerPage>
                        <rdf:Seq>
                          <rdf:li>794</rdf:li>
                        </rdf:Seq>
                      </pdf:charsPerPage>
                      <pdf:annotationTypes>
                        <rdf:Bag>
                          <rdf:li>95e8dd6e9b4c5a3d-3d44cd989a3a348c</rdf:li>
                          <rdf:li>95e8dd6f9b4c5a3e-3d44cd979a3a348b</rdf:li>
                          <rdf:li>95e8dd709b4c5a3f-3d44cd969a3a348a</rdf:li>
                          <rdf:li>95e8dd719b4c5a40-3d44cd959a3a3489</rdf:li>
                          <rdf:li>95e8dd729b4c5a41-3d44cd949a3a3488</rdf:li>
                        </rdf:Bag>
                      </pdf:annotationTypes>
                      <pdf:annotationSubtypes>
                        <rdf:Bag>
                          <rdf:li>Link</rdf:li>
                        </rdf:Bag>
                      </pdf:annotationSubtypes>
                    </rdf:Description>
                  </rdf:RDF>
                </x:xmpmeta>

and

        journalctl -f -u tika

                Jul 26 10:50:05 mx-test.example.net tika[14096]: INFO  
[qtp641030345-33] 10:50:05,573 
org.apache.tika.server.core.resource.MetadataResource /meta (autodetecting type)

finally, with same config, on receipt of email, submission to tika backend via 
dovecot,

        journalctl -f -u tika

                Jul 26 11:16:04 mx-test.example.net tika[14096]: INFO  
[qtp641030345-31] 11:16:04,013 
org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)


dovecot logs

        ==> /var/log/dovecot/dovecot-debug.log <==
        2022-07-26 11:16:03 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve: Xapian library version: 1.4.19
        2022-07-26 11:16:03 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve(INBOX): Opened DB (RO) messages=0 version=1 shards=1
        2022-07-26 11:16:03 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve(INBOX): Last UID uid=0
        2022-07-26 11:16:03 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve(INBOX): Last UID uid=0
        2022-07-26 11:16:03 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve(INBOX): Opened DB (RW; current.1658584490654708) messages=0 
version=1
        2022-07-26 11:16:03 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve(INBOX): Indexing uid=93698

this pause is the to-tika submit, and return ...

        2022-07-26 11:16:04 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve(INBOX): Committed 1 changes to DB (RW; 
current.1658584490654708) in 0.074 secs
        2022-07-26 11:16:04 
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
 Debug: fts-flatcurve: Update transaction completed in 0.386 secs

... with the subsequently successfully updated index

yay. o/

Reply via email to