I'd stared repeatedly at
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
in the docs. seemed reasonable that since TesseractOCRParser *is* the default
parser, exlcuding it made no sense.
guess not!
with config,
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
+ <parser class="org.apache.tika.parser.DefaultParser">
+ <parser-exclude
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
+ </parser>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<param name="skipOcr" type="bool">false</param>
<param name="tessdataPath"
type="string">/usr/share/tesseract/tessdata</param>
<param name="tesseractPath" type="string">/usr/bin</param>
<param name="maxFileSizeToOcr" type="long">2147483647</param>
<param name="minFileSizeToOcr" type="long">0</param>
<param name="applyRotation" type="bool">true</param>
<param name="enableImagePreprocessing" type="bool">true</param>
<param name="preserveInterwordSpacing" type="bool">true</param>
<param name="timeoutSeconds" type="int">180</param>
</params>
</parser>
</parsers>
<server>
curl correctly returns
curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/meta
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core
Test.SNAPSHOT">
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/"
pdf:PDFVersion="1.7"
pdf:hasXFA="false"
pdf:hasCollection="false"
pdf:encrypted="false"
pdf:hasMarkedContent="false"
pdf:producer="Adobe PDF Library 15.0"
pdf:hasXMP="true"
xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
xmp:CreateDate="2020-10-14T17:08:10Z"
xmp:ModifyDate="2020-10-14T17:08:10Z"
xmp:MetadataDate="2020-10-14T17:08:10Z"
dc:format="application/pdf; version=1.7"
dc:language="en-US"
xmpMM:DocumentID="xmp.id:7a865d84-8dbf-4015-96b7-fdae89a9603b"
xmpTPg:NPages="1">
<pdf:unmappedUnicodeCharsPerPage>
<rdf:Seq>
<rdf:li>0</rdf:li>
</rdf:Seq>
</pdf:unmappedUnicodeCharsPerPage>
<pdf:charsPerPage>
<rdf:Seq>
<rdf:li>794</rdf:li>
</rdf:Seq>
</pdf:charsPerPage>
<pdf:annotationTypes>
<rdf:Bag>
<rdf:li>95e8dd6e9b4c5a3d-3d44cd989a3a348c</rdf:li>
<rdf:li>95e8dd6f9b4c5a3e-3d44cd979a3a348b</rdf:li>
<rdf:li>95e8dd709b4c5a3f-3d44cd969a3a348a</rdf:li>
<rdf:li>95e8dd719b4c5a40-3d44cd959a3a3489</rdf:li>
<rdf:li>95e8dd729b4c5a41-3d44cd949a3a3488</rdf:li>
</rdf:Bag>
</pdf:annotationTypes>
<pdf:annotationSubtypes>
<rdf:Bag>
<rdf:li>Link</rdf:li>
</rdf:Bag>
</pdf:annotationSubtypes>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
and
journalctl -f -u tika
Jul 26 10:50:05 mx-test.example.net tika[14096]: INFO
[qtp641030345-33] 10:50:05,573
org.apache.tika.server.core.resource.MetadataResource /meta (autodetecting type)
finally, with same config, on receipt of email, submission to tika backend via
dovecot,
journalctl -f -u tika
Jul 26 11:16:04 mx-test.example.net tika[14096]: INFO
[qtp641030345-31] 11:16:04,013
org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
dovecot logs
==> /var/log/dovecot/dovecot-debug.log <==
2022-07-26 11:16:03
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve: Xapian library version: 1.4.19
2022-07-26 11:16:03
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve(INBOX): Opened DB (RO) messages=0 version=1 shards=1
2022-07-26 11:16:03
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve(INBOX): Last UID uid=0
2022-07-26 11:16:03
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve(INBOX): Last UID uid=0
2022-07-26 11:16:03
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve(INBOX): Opened DB (RW; current.1658584490654708) messages=0
version=1
2022-07-26 11:16:03
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve(INBOX): Indexing uid=93698
this pause is the to-tika submit, and return ...
2022-07-26 11:16:04
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve(INBOX): Committed 1 changes to DB (RW;
current.1658584490654708) in 0.074 secs
2022-07-26 11:16:04
indexer-worker([email protected])<ShgPEzMF4GK6TAAA+IOfAw:4HbMNTMF4GK+TAAA+IOfAw>:
Debug: fts-flatcurve: Update transaction completed in 0.386 secs
... with the subsequently successfully updated index
yay. o/