I got it to work with tika-server-standard and
curl -X PUT --data-binary @Get_Started_With_Smallpdf.pdf
http://localhost:9998/tika --header "Content-type: application/pdf"
and got a text and no nasty stuff on the console.
Tilman
Am 15.07.2022 um 18:01 schrieb Tim Allison:
If you curl the test file (GetStartedWithSmallpdf.pdf) against your
tika-server, what do you see? The test file works for me with
2.4.2-SNAPSHOT at least. Are the files getting truncated somehow?
On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <[email protected]> wrote:
i'm running tika-server 2.4.1 on a linux box,
lsb_release -rd
Description: Fedora release 36 (Thirty Six)
Release: 36
uname -rm
5.18.11-200.fc36.x86_64 x86_64
java -version
Picked up JAVA_TOOL_OPTIONS: -Xmx512M
openjdk version "18.0.1" 2022-04-19
OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10,
mixed mode, sharing)
ps ax | grep tika-server
1003 ? Ssl 0:12 /usr/bin/java -jar
/srv/webapps/tika/tika-server.jar -c
/usr/local/etc/tika/tika-server-config-custom.xml
1143 ? Sl 0:37 /usr/bin/java -Xms1g -Xmx1g
-Dpdfbox.fontcache=/var/tika -Dlog4j2.info
-Djava.awt.headless=true -cp /srv/webapps/tika/tika-server.jar
-Dtika.server.id <http://Dtika.server.id>=
org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998
-i -c /usr/local/etc/tika/tika-server-config-custom.xml
-forkedStatusFile
/tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0
it's invoked from a dovecot imap server instance, for attachment
parsing,
dovecot --version
2.3.19.1 (9b53102964)
cat dovecot/conf.d/10-master.com <http://10-master.com>
...
plugin {
...
fts_tika = http://127.0.0.1:9998/tika/
}
...
on receipt of an email with a standard attachment/exmaple -- e.g.
the example pdf @
https://smallpdf.com/edit-pdf
, per journal logs, the message is submitted to tika, but fails
due to a 'corrupt stream'
Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
08:41:27,224 org.apache.tika.server.core.resource.TikaResource
/tika (application/pdf)
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the
stream doesn't point to the correct offset, using workaround to
read the stream, stream start position: 104315, length: 356,
expected end position: 104671
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter:
stop reading corrupt stream due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the
stream doesn't point to the correct offset, using workaround to
read the stream, stream start position: 101699, length: 1472,
expected end position: 103171
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter:
stop reading corrupt stream due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the
stream doesn't point to the correct offset, using workaround to
read the stream, stream start position: 101509, length: 66,
expected end position: 101575
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter:
stop reading corrupt stream due to a DataFormatException
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the
stream doesn't point to the correct offset, using workaround to
read the stream, stream start position: 2011, length: 2482,
expected end position: 4493
Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
08:41:27,752 org.apache.tika.server.core.resource.TikaResource
tika/: Text extraction failed (test.pdf)
Jul 15 08:41:27 mx tika[1143]:
org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.Server.handle(Server.java:516)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.io
<http://org.eclipse.jetty.io>.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.io
<http://org.eclipse.jetty.io>.FillInterest.fillable(FillInterest.java:105)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.io
<http://org.eclipse.jetty.io>.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
java.lang.Thread.run(Thread.java:833) ~[?:?]
Jul 15 08:41:27 mx tika[1143]: Caused by:
java.io.IOException: Page tree root must be a dictionary
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-2.4.1.jar:2.4.1]
Jul 15 08:41:27 mx tika[1143]: ... 37 more
Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with
writing the data, class
org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
ContentType: text/plain
Is this likely an issue with tika-server itself? &/or java/dovecot?
What additional diagnostics can help narrow it down?