i'm running tika-server 2.4.1 on a linux box,

        lsb_release -rd
                Description:    Fedora release 36 (Thirty Six)
                Release:        36

        uname -rm
                5.18.11-200.fc36.x86_64 x86_64

        java -version
                Picked up JAVA_TOOL_OPTIONS: -Xmx512M
                openjdk version "18.0.1" 2022-04-19
                OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
                OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed mode, 
sharing)


        ps ax | grep tika-server
           1003 ?        Ssl    0:12 /usr/bin/java -jar 
/srv/webapps/tika/tika-server.jar -c 
/usr/local/etc/tika/tika-server-config-custom.xml
           1143 ?        Sl     0:37 /usr/bin/java -Xms1g -Xmx1g 
-Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp 
/srv/webapps/tika/tika-server.jar -Dtika.server.id= 
org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i  -c 
/usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile 
/tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0

it's invoked from a dovecot imap server instance, for attachment parsing,

        dovecot --version
                2.3.19.1 (9b53102964)

        cat dovecot/conf.d/10-master.com
                ...
                plugin {
                        ...
                        fts_tika = http://127.0.0.1:9998/tika/
                }
                ...

on receipt of an email with a standard attachment/exmaple -- e.g. the example 
pdf @

        https://smallpdf.com/edit-pdf

, per journal logs, the message is submitted to tika, but fails due to a 
'corrupt stream'

        Jul 15 08:41:27 mx tika[1143]: INFO  [qtp1837533591-27] 08:41:27,224 
org.apache.tika.server.core.resource.TikaResource /tika (application/pdf)
        Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,453 
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to 
the correct offset, using workaround to read the stream, stream start position: 
104315, length: 356, expected end position: 104671
        Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,457 
org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream 
due to a DataFormatException
        Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,730 
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to 
the correct offset, using workaround to read the stream, stream start position: 
101699, length: 1472, expected end position: 103171
        Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,735 
org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream 
due to a DataFormatException
        Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,742 
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to 
the correct offset, using workaround to read the stream, stream start position: 
101509, length: 66, expected end position: 101575
        Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,744 
org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading corrupt stream 
due to a DataFormatException
        Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,748 
org.apache.pdfbox.pdfparser.COSParser The end of the stream doesn't point to 
the correct offset, using workaround to read the stream, stream start position: 
2011, length: 2482, expected end position: 4493
        Jul 15 08:41:27 mx tika[1143]: WARN  [qtp1837533591-27] 08:41:27,752 
org.apache.tika.server.core.resource.TikaResource tika/: Text extraction failed 
(test.pdf)
        Jul 15 08:41:27 mx tika[1143]: org.apache.tika.exception.TikaException: 
TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.Server.handle(Server.java:516) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
 ~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
java.lang.Thread.run(Thread.java:833) ~[?:?]
        Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException: Page 
tree root must be a dictionary
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
~[tika-server-standard-2.4.1.jar:2.4.1]
        Jul 15 08:41:27 mx tika[1143]:         ... 37 more
        Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] 08:41:27,767 
org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the data, class 
org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
 ContentType: text/plain

Is this likely an issue with tika-server itself? &/or java/dovecot?

What additional diagnostics can help narrow it down?

Reply via email to