If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see? The test file works for me with 2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <[email protected]> wrote: > i'm running tika-server 2.4.1 on a linux box, > > lsb_release -rd > Description: Fedora release 36 (Thirty Six) > Release: 36 > > uname -rm > 5.18.11-200.fc36.x86_64 x86_64 > > java -version > Picked up JAVA_TOOL_OPTIONS: -Xmx512M > openjdk version "18.0.1" 2022-04-19 > OpenJDK Runtime Environment 22.3 (build 18.0.1+10) > OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10, mixed > mode, sharing) > > > ps ax | grep tika-server > 1003 ? Ssl 0:12 /usr/bin/java -jar > /srv/webapps/tika/tika-server.jar -c > /usr/local/etc/tika/tika-server-config-custom.xml > 1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g > -Dpdfbox.fontcache=/var/tika -Dlog4j2.info -Djava.awt.headless=true -cp > /srv/webapps/tika/tika-server.jar -Dtika.server.id= > org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998 -i -c > /usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile > /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0 > > it's invoked from a dovecot imap server instance, for attachment parsing, > > dovecot --version > 2.3.19.1 (9b53102964) > > cat dovecot/conf.d/10-master.com > ... > plugin { > ... > fts_tika = http://127.0.0.1:9998/tika/ > } > ... > > on receipt of an email with a standard attachment/exmaple -- e.g. the > example pdf @ > > https://smallpdf.com/edit-pdf > > , per journal logs, the message is submitted to tika, but fails due to a > 'corrupt stream' > > Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27] > 08:41:27,224 org.apache.tika.server.core.resource.TikaResource /tika > (application/pdf) > Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] > 08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the stream > doesn't point to the correct offset, using workaround to read the stream, > stream start position: 104315, length: 356, expected end position: 104671 > Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] > 08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading > corrupt stream due to a DataFormatException > Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] > 08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the stream > doesn't point to the correct offset, using workaround to read the stream, > stream start position: 101699, length: 1472, expected end position: 103171 > Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] > 08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading > corrupt stream due to a DataFormatException > Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] > 08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the stream > doesn't point to the correct offset, using workaround to read the stream, > stream start position: 101509, length: 66, expected end position: 101575 > Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] > 08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter: stop reading > corrupt stream due to a DataFormatException > Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] > 08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the stream > doesn't point to the correct offset, using workaround to read the stream, > stream start position: 2011, length: 2482, expected end position: 4493 > Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27] > 08:41:27,752 org.apache.tika.server.core.resource.TikaResource tika/: Text > extraction failed (test.pdf) > Jul 15 08:41:27 mx tika[1143]: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@356fdbd7 > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.Server.handle(Server.java:516) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > java.lang.Thread.run(Thread.java:833) ~[?:?] > Jul 15 08:41:27 mx tika[1143]: Caused by: java.io.IOException: > Page tree root must be a dictionary > Jul 15 08:41:27 mx tika[1143]: at > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > ~[tika-server-standard-2.4.1.jar:2.4.1] > Jul 15 08:41:27 mx tika[1143]: ... 37 more > Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27] > 08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with writing the > data, class > org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8, > ContentType: text/plain > > Is this likely an issue with tika-server itself? &/or java/dovecot? > > What additional diagnostics can help narrow it down? >
