hi,

On 7/20/22 10:07 AM, Tim Allison wrote:
Sorry...just catching up on this.  If you want the digest of the incoming bytes 
and you can configure tika-server via a config file, try this as the config 
(e.g. tika-config-digest.xml)

<properties>
   <server>
     <params>
       <digest>sha256</digest>
     </params>
   </server>
</properties>

then start the server: java -jar tika-server-standard-xyz.jar -c 
tika-config-digest.xml

Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf 
http://localhost:9998/tika <http://localhost:9998/tika>

i'm already normally launching tika service as,

        cat  /etc/systemd/system/tika.service
                [Unit]
                Description=Apache Tika server
                After=network-online.target
                Requires=network-online.target

                [Service]
                SyslogIdentifier=tika
                User=tika
                Group=tika
                ExecStart=/usr/bin/java \
                 -jar /srv/tika/tika-server.jar \
!!               -c /etc/tika/tika-server-config-custom.xml

                [Install]
                WantedBy=multi-user.target

where

        cat /etc/tika/tika-server-config-custom.xml
                <?xml version="1.0" encoding="UTF-8"?>
                <properties>
                  <server>
                    <params>
                      <logLevel>debug</logLevel>
                      <port>9998</port>
                      <host>127.0.0.1</host>
                      <javaPath>/usr/bin/java</javaPath>
                      <noFork>false</noFork>
                      <forkedJvmArgs>
                        <arg>-Xms1g</arg>
                        <arg>-Xmx1g</arg>
                        <arg>-Dpdfbox.fontcache=/var/tika</arg>
                        <arg>-Dlog4j2.debug</arg>
                      </forkedJvmArgs>

!!                    <digest>sha256</digest>
                      <enableUnsecureFeatures>false</enableUnsecureFeatures>
                      <id></id>
                      <maxFiles>100000</maxFiles>
                      <maxForkedStartupMillis>120000</maxForkedStartupMillis>
                      <maxRestarts>-1</maxRestarts>
                      <minimumTimeoutMillis>30000</minimumTimeoutMillis>
                      <returnStackTrace>false</returnStackTrace>
                      <taskPulseMillis>10000</taskPulseMillis>
                      <taskTimeoutMillis>300000</taskTimeoutMillis>

                      <endpoints>
                        <endpoint>tika</endpoint>
                        <endpoint>status</endpoint>
                        <endpoint>rmeta</endpoint>
                      </endpoints>

                    </params>
                  </server>
                </properties>

DL'ing the _latest_ build

        F="tika-server-standard-2.4.2-20220720.025305-98.jar"
        D="/srv/tika"
        cd ${D}
        rm -rf TMP
        mkdir -p TMP/mod
        cd TMP
        rm -f ${F}*
        wget 
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
        cd mod

extract

        jar -xfv ../${F}

mod logging

        perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml

repack

        jar -cvmf META-INF/MANIFEST.MF ../mod.jar *

my usual target symlink

        cd ${D}
        ln -sf TMP/mod.jar tika-server.jar

stop tiks service, if any

        systemctl stop tika
        systemctl disable tika
        systemctl status tika  -ln0
                ○ tika.service - Apache Tika server
                     Loaded: loaded (/etc/systemd/system/tika.service; 
disabled; vendor preset: disabled)
                     Active: inactive (dead)
        ps ax | grep tika
                (empty)

start manually

        /usr/bin/java \
         -jar /srv/tika/tika-server.jar \
         -c /etc/tika/tika-server-config-custom.xml

                ...
                INFO  [main] 10:49:37,925 
org.apache.tika.server.core.TikaServerProcess Started Apache Tika server  at 
http://127.0.0.1:9998/

, console persists here for this active process

        ps ax | grep tika
                29181 pts/0    Sl+    0:07 /usr/bin/java -jar 
/srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
                29202 pts/0    Sl+    0:16 /usr/bin/java -Xms1g -Xmx1g 
-Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp 
/srv/tika/tika-server.jar -Dtika.server.id= 
org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i  -c 
/etc/tika/tika-server-config-custom.xml -forkedStatusFile 
/tmp/apache-tika-server-forked-tmp-9024552766199524298 -numRestarts 0


exec in other shell window

        curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika

@ console for the *curl* command, I see

        <?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
            <head>
                <meta name="pdf:PDFVersion" content="1.7"/>
                ...
                <meta name="X-TIKA:digest:SHA256" 
content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>

but nothing seemingly relevant/informative in the java/tika console session; 
lots of DEBUG etc, but no sha256sum info

in any case, for this scenario, checking original

        sha256sum ~/Get_Started_With_Smallpdf.pdf
                
91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2  
/root/Get_Started_With_Smallpdf.pdf

it's a match.

but that's NOT testing the fail scenario.

THAT scenario is email send/receive -> dovecot -> dovecot fts-tika plugin -> 
tika-server.

config'ing dovecot to use fts-tika scanning

        fts_tika = http://127.0.0.1:9998/tika/

& generate verbose debug logs

        mail_debug = yes

when I exec that send/receive -- from, e.g., an external gmail account to my 
server

I see the attachment handoff.  1st, sent from dovecot fts-tika

        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: queue http://127.0.0.1:9998: Connection to peer 127.0.0.1:9998 
claimed request [Req1: PUT http://127.0.0.1:9998/tika/]
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: conn 127.0.0.1:9998 [1]: Claimed request [Req1: PUT 
http://127.0.0.1:9998/tika/]
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Sent header
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
5562, buffered=5570)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: peer 127.0.0.1:9998: No more requests to service for this peer 
(1 connections exist, 0 pending)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
6048, buffered=6056)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent 
3409, buffered=3416)
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished sending 
payload
        2022-07-20 11:07:02 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request 
to finish

        ==> /var/log/dovecot/dovecot-info.log <==
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Info: 
sieve: msgid=<[email protected]>: stored mail into 
mailbox 'INBOX'

        ==> /var/log/dovecot/dovecot-debug.log <==
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Debug: 
sieve: msgid=<[email protected]>: Finish implicit keep 
action
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Debug: 
sieve: msgid=<[email protected]>: Finishing actions
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Debug: 
sieve: msgid=<[email protected]>: Finished executing 
result (final, status=ok, keep=yes)
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: 
Debug: sieve: multi-script: Sequence finished (status=ok, keep=yes)
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: 
Debug: sieve: multi-script: Destroy
        2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: 
Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt [email protected]: 
duplicate db: Cleanup
        2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn 
unix:pid=39462,uid=89 [1]: rcpt [email protected]: User session is finished
        2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn 
unix:pid=39462,uid=89 [1]: rcpt [email protected]: dict(file): dict destroyed

        ==> /var/log/dovecot/dovecot-info.log <==
        2022-07-20 11:07:02 lmtp(39463): Info: Disconnect from local: Logged 
out (state=READY)

        ==> /var/log/dovecot/dovecot-debug.log <==
        2022-07-20 11:07:06 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: conn 127.0.0.1:9998 [1]: Got 200 response for request [Req1: PUT 
http://127.0.0.1:9998/tika/]: OK (took 3327 ms + 217 ms in queue)
        2022-07-20 11:07:06 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: conn 127.0.0.1:9998 [1]: Response payload stream destroyed (20 
ms after initial response)
        2022-07-20 11:07:06 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished
        2022-07-20 11:07:06 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: queue http://127.0.0.1:9998: Dropping request [Req1: PUT 
http://127.0.0.1:9998/tika/]
        2022-07-20 11:07:06 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: host 127.0.0.1: Host is idle (timeout = 100 msecs)
        2022-07-20 11:07:06 
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
 Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Free (requests 
left=1)

at this point, dovecot's 'done' with the attachment as far as tika is involved, 
and it's 'in' tika-backend's control; dovecot DOES of course continue to 
process, and ultimately deliver, the email+attachment to my inbox.  where, as 
reported earlier, I can verify that the RECEIVED attachment is identical in 
size/sha256sum to the original.

i do see the handoff to tika-backend,

        ...
        DEBUG [qtp485047320-28] 11:01:15,794 
org.eclipse.jetty.server.HttpChannel REQUEST for //127.0.0.1:9998/tika/ on 
HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING 
os=OPEN is=IDLE awp=false se=false i=true 
al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=1}
        PUT //127.0.0.1:9998/tika/ HTTP/1.1
        Host: 127.0.0.1:9998
        Date: Wed, 20 Jul 2022 15:01:15 GMT
        Transfer-Encoding: chunked
        Connection: keep-alive
        Content-Type: application/pdf
        Content-Disposition: attachment; 
filename="Get_Started_With_Smallpdf.pdf"
        Accept: text/plain


        DEBUG [qtp485047320-28] 11:01:15,799 org.eclipse.jetty.server.HttpConnection 
HttpConnection@7d858986::SocketChannelEndPoint@7f055fae{l=/127.0.0.1:9998,r=/127.0.0.1:59150,OPEN,fill=-,flush=-,to=43/200000}{io=0/0,kio=0,kro=1}->HttpConnection@7d858986[p=HttpParser{s=CHUNKED_CONTENT,0
 of 
-1},g=HttpGenerator@127a4f1e{s=START}]=>HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE
 rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true 
al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=6} parsed true 
HttpParser{s=CHUNKED_CONTENT,0 of -1}
        ...
        TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class 
org.apache.cxf.common.logging.Slf4jLogger
        DEBUG [qtp485047320-31] 11:07:03,442 org.apache.cxf.transport.http.Headers 
Request Headers: {Accept=[text/plain], Authorization=[***], connection=[keep-alive], 
Content-Disposition=[attachment; filename="Get_Started_With_Smallpdf.pdf"], 
content-type=[application/pdf], Date=[Wed, 20 Jul 2022 15:07:02 GMT], 
Host=[127.0.0.1:9998], Proxy-Authorization=[***], transfer-encoding=[chunked]}
        TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class 
org.apache.cxf.common.logging.Slf4jLogger
        ...

but no trace, that I can find in any log, of sha256sum generated by tika, as in 
the curl case above.

THAT is the necessary bit here -- getting at, and confirming, the 
size/sha256sum of what Tika has received -- from dovecot's fts-tika handoff.

how/where to get tika to spit our THAT info?
either as loggable/logged response to dovecot's http-client connection, on 
successful handoff,
in its own logs,
or, just trapping the file and checking manually?

Reply via email to