i'm running tika-server 2.4.1 on a linux box,
lsb_release -rd
Description: Fedora release 36 (Thirty Six)
Release: 36
uname -rm
5.18.11-200.fc36.x86_64 x86_64
java -version
Picked up JAVA_TOOL_OPTIONS: -Xmx512M
openjdk version "18.0.1" 2022-04-19
OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed mode,
sharing)
ps ax | grep tika-server
1003 ? Ssl 0:12 /usr/bin/java -jar
/srv/webapps/tika/tika-server.jar -c
/usr/local/etc/tika/tika-server-config-custom.xml
1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g
-Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp
/srv/webapps/tika/tika-server.jar -Dtika.server.id=
org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c
/usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile
/tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
it's invoked from a dovecot imap server instance, for attachment parsing,
dovecot --version
2.3.19.1 (9b53102964)
cat dovecot/conf.d/10-master.com
...
plugin {
...
fts_tika = http://127.0.0.1:9998/tika/
}
...
on receipt of an email with a standard attachment/exmaple -- e.g. the example
pdf @
https://smallpdf.com/edit-pdf
, per journal logs, the message is submitted to tika, but fails due to a
'corrupt stream'
Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27] 08:41:27,224
org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,453
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to
the correct offset, using workaround to read the stream, stream start position:
104315, length: 356, expected end position: 104671
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,457
org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream
due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,730
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to
the correct offset, using workaround to read the stream, stream start position:
101699, length: 1472, expected end position: 103171
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,735
org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream
due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,742
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to
the correct offset, using workaround to read the stream, stream start position:
101509, length: 66, expected end position: 101575
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,744
org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream
due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,748
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to
the correct offset, using workaround to read the stream, stream start position:
2011, length: 2482, expected end position: 4493
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] 08:41:27,752
org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed
(test.pdf)
Jul 15 08:41:27 mx tika[1143]: org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException: Page
tree root must be a dictionary
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: ... 37 more
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,767
org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
ContentType: text/plain
Is this likely an issue with tika-server itself? &/or java/dovecot?
What additional diagnostics can help narrow it down?