hi,
On 7/20/22 10:07 AM, Tim Allison wrote:
Sorry...just catching up on this. If you want the digest of the incoming bytes
and you can configure tika-server via a config file, try this as the config
(e.g. tika-config-digest.xml)
<properties>
<server>
<params>
<digest>sha256</digest>
</params>
</server>
</properties>
then start the server: java -jar tika-server-standard-xyz.jar -c
tika-config-digest.xml
Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf
http://localhost:9998/tika <http://localhost:9998/tika>
i'm already normally launching tika service as,
cat /etc/systemd/system/tika.service
[Unit]
Description=Apache Tika server
After=network-online.target
Requires=network-online.target
[Service]
SyslogIdentifier=tika
User=tika
Group=tika
ExecStart=/usr/bin/java \
-jar /srv/tika/tika-server.jar \
!! -c /etc/tika/tika-server-config-custom.xml
[Install]
WantedBy=multi-user.target
where
cat /etc/tika/tika-server-config-custom.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<server>
<params>
<logLevel>debug</logLevel>
<port>9998</port>
<host>127.0.0.1</host>
<javaPath>/usr/bin/java</javaPath>
<noFork>false</noFork>
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
<arg>-Dpdfbox.fontcache=/var/tika</arg>
<arg>-Dlog4j2.debug</arg>
</forkedJvmArgs>
!! <digest>sha256</digest>
<enableUnsecureFeatures>false</enableUnsecureFeatures>
<id></id>
<maxFiles>100000</maxFiles>
<maxForkedStartupMillis>120000</maxForkedStartupMillis>
<maxRestarts>-1</maxRestarts>
<minimumTimeoutMillis>30000</minimumTimeoutMillis>
<returnStackTrace>false</returnStackTrace>
<taskPulseMillis>10000</taskPulseMillis>
<taskTimeoutMillis>300000</taskTimeoutMillis>
<endpoints>
<endpoint>tika</endpoint>
<endpoint>status</endpoint>
<endpoint>rmeta</endpoint>
</endpoints>
</params>
</server>
</properties>
DL'ing the _latest_ build
F="tika-server-standard-2.4.2-20220720.025305-98.jar"
D="/srv/tika"
cd ${D}
rm -rf TMP
mkdir -p TMP/mod
cd TMP
rm -f ${F}*
wget
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server-standard/2.4.2-SNAPSHOT/${F}
cd mod
extract
jar -xfv ../${F}
mod logging
perl -pi -e 's|Root level="info"|Root level="debug"|g' log4j2.xml
repack
jar -cvmf META-INF/MANIFEST.MF ../mod.jar *
my usual target symlink
cd ${D}
ln -sf TMP/mod.jar tika-server.jar
stop tiks service, if any
systemctl stop tika
systemctl disable tika
systemctl status tika -ln0
○ tika.service - Apache Tika server
Loaded: loaded (/etc/systemd/system/tika.service;
disabled; vendor preset: disabled)
Active: inactive (dead)
ps ax | grep tika
(empty)
start manually
/usr/bin/java \
-jar /srv/tika/tika-server.jar \
-c /etc/tika/tika-server-config-custom.xml
...
INFO [main] 10:49:37,925
org.apache.tika.server.core.TikaServerProcess Started Apache Tika server at
http://127.0.0.1:9998/
, console persists here for this active process
ps ax | grep tika
29181 pts/0 Sl+ 0:07 /usr/bin/java -jar
/srv/tika/tika-server.jar -c /etc/tika/tika-server-config-custom.xml
29202 pts/0 Sl+ 0:16 /usr/bin/java -Xms1g -Xmx1g
-Dpdfbox.fontcache=/var/tika -Dlog4j2.debug -Djava.awt.headless=true -cp
/srv/tika/tika-server.jar -Dtika.server.id=
org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c
/etc/tika/tika-server-config-custom.xml -forkedStatusFile
/tmp/apache-tika-server-forked-tmp-9024552766199524298 -numRestarts 0
exec in other shell window
curl -T ~/Get_Started_With_Smallpdf.pdf http://127.0.0.1:9998/tika
@ console for the *curl* command, I see
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.7"/>
...
<meta name="X-TIKA:digest:SHA256"
content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>
but nothing seemingly relevant/informative in the java/tika console session;
lots of DEBUG etc, but no sha256sum info
in any case, for this scenario, checking original
sha256sum ~/Get_Started_With_Smallpdf.pdf
91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2
/root/Get_Started_With_Smallpdf.pdf
it's a match.
but that's NOT testing the fail scenario.
THAT scenario is email send/receive -> dovecot -> dovecot fts-tika plugin ->
tika-server.
config'ing dovecot to use fts-tika scanning
fts_tika = http://127.0.0.1:9998/tika/
& generate verbose debug logs
mail_debug = yes
when I exec that send/receive -- from, e.g., an external gmail account to my
server
I see the attachment handoff. 1st, sent from dovecot fts-tika
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: queue http://127.0.0.1:9998: Connection to peer 127.0.0.1:9998
claimed request [Req1: PUT http://127.0.0.1:9998/tika/]
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: conn 127.0.0.1:9998 [1]: Claimed request [Req1: PUT
http://127.0.0.1:9998/tika/]
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Sent header
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
5562, buffered=5570)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: peer 127.0.0.1:9998: No more requests to service for this peer
(1 connections exist, 0 pending)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
6048, buffered=6056)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Send more (sent
3409, buffered=3416)
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished sending
payload
2022-07-20 11:07:02
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Waiting for request
to finish
==> /var/log/dovecot/dovecot-info.log <==
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Info:
sieve: msgid=<[email protected]>: stored mail into
mailbox 'INBOX'
==> /var/log/dovecot/dovecot-debug.log <==
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Debug:
sieve: msgid=<[email protected]>: Finish implicit keep
action
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Debug:
sieve: msgid=<[email protected]>: Finishing actions
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>: Debug:
sieve: msgid=<[email protected]>: Finished executing
result (final, status=ok, keep=yes)
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>:
Debug: sieve: multi-script: Sequence finished (status=ok, keep=yes)
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>:
Debug: sieve: multi-script: Destroy
2022-07-20 11:07:02 lmtp([email protected])<wNv7DRUa2GInmgAA+IOfAw>:
Debug: lmtp-server: conn unix:pid=39462,uid=89 [1]: rcpt [email protected]:
duplicate db: Cleanup
2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn
unix:pid=39462,uid=89 [1]: rcpt [email protected]: User session is finished
2022-07-20 11:07:02 lmtp(39463): Debug: lmtp-server: conn
unix:pid=39462,uid=89 [1]: rcpt [email protected]: dict(file): dict destroyed
==> /var/log/dovecot/dovecot-info.log <==
2022-07-20 11:07:02 lmtp(39463): Info: Disconnect from local: Logged
out (state=READY)
==> /var/log/dovecot/dovecot-debug.log <==
2022-07-20 11:07:06
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: conn 127.0.0.1:9998 [1]: Got 200 response for request [Req1: PUT
http://127.0.0.1:9998/tika/]: OK (took 3327 ms + 217 ms in queue)
2022-07-20 11:07:06
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: conn 127.0.0.1:9998 [1]: Response payload stream destroyed (20
ms after initial response)
2022-07-20 11:07:06
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Finished
2022-07-20 11:07:06
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: queue http://127.0.0.1:9998: Dropping request [Req1: PUT
http://127.0.0.1:9998/tika/]
2022-07-20 11:07:06
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: host 127.0.0.1: Host is idle (timeout = 100 msecs)
2022-07-20 11:07:06
indexer-worker([email protected])<wNv7DRUa2GInmgAA+IOfAw:xo49IxYa2GIqmgAA+IOfAw>:
Debug: http-client: request [Req1: PUT http://127.0.0.1/tika/]: Free (requests
left=1)
at this point, dovecot's 'done' with the attachment as far as tika is involved,
and it's 'in' tika-backend's control; dovecot DOES of course continue to
process, and ultimately deliver, the email+attachment to my inbox. where, as
reported earlier, I can verify that the RECEIVED attachment is identical in
size/sha256sum to the original.
i do see the handoff to tika-backend,
...
DEBUG [qtp485047320-28] 11:01:15,794
org.eclipse.jetty.server.HttpChannel REQUEST for //127.0.0.1:9998/tika/ on
HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE rs=BLOCKING
os=OPEN is=IDLE awp=false se=false i=true
al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=1}
PUT //127.0.0.1:9998/tika/ HTTP/1.1
Host: 127.0.0.1:9998
Date: Wed, 20 Jul 2022 15:01:15 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Content-Type: application/pdf
Content-Disposition: attachment;
filename="Get_Started_With_Smallpdf.pdf"
Accept: text/plain
DEBUG [qtp485047320-28] 11:01:15,799 org.eclipse.jetty.server.HttpConnection
HttpConnection@7d858986::SocketChannelEndPoint@7f055fae{l=/127.0.0.1:9998,r=/127.0.0.1:59150,OPEN,fill=-,flush=-,to=43/200000}{io=0/0,kio=0,kro=1}->HttpConnection@7d858986[p=HttpParser{s=CHUNKED_CONTENT,0
of
-1},g=HttpGenerator@127a4f1e{s=START}]=>HttpChannelOverHttp@2ab20b5f{s=HttpChannelState@1dd88b59{s=IDLE
rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true
al=0},r=1,c=false/false,a=IDLE,uri=//127.0.0.1:9998/tika/,age=6} parsed true
HttpParser{s=CHUNKED_CONTENT,0 of -1}
...
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class
org.apache.cxf.common.logging.Slf4jLogger
DEBUG [qtp485047320-31] 11:07:03,442 org.apache.cxf.transport.http.Headers
Request Headers: {Accept=[text/plain], Authorization=[***], connection=[keep-alive],
Content-Disposition=[attachment; filename="Get_Started_With_Smallpdf.pdf"],
content-type=[application/pdf], Date=[Wed, 20 Jul 2022 15:07:02 GMT],
Host=[127.0.0.1:9998], Proxy-Authorization=[***], transfer-encoding=[chunked]}
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class
org.apache.cxf.common.logging.Slf4jLogger
...
but no trace, that I can find in any log, of sha256sum generated by tika, as in
the curl case above.
THAT is the necessary bit here -- getting at, and confirming, the
size/sha256sum of what Tika has received -- from dovecot's fts-tika handoff.
how/where to get tika to spit our THAT info?
either as loggable/logged response to dovecot's http-client connection, on
successful handoff,
in its own logs,
or, just trapping the file and checking manually?