I got it to work with tika-server-standard and

curl -X PUT --data-binary @Get_Started_With_Smallpdf.pdf http://localhost:9998/tika --header "Content-type: application/pdf"

and got a text and no nasty stuff on the console.

Tilman

Am 15.07.2022 um 18:01 schrieb Tim Allison:
If you curl the test file (GetStartedWithSmallpdf.pdf) against your tika-server, what do you see?  The test file works for me with 2.4.2-SNAPSHOT at least.  Are the files getting truncated somehow?



On Fri, Jul 15, 2022 at 9:41 AM PGNet Dev <[email protected]> wrote:

      i'm running tika-server 2.4.1 on a linux box,

            lsb_release -rd
                    Description:    Fedora release 36 (Thirty Six)
                    Release:        36

            uname -rm
                    5.18.11-200.fc36.x86_64 x86_64

            java -version
                    Picked up JAVA_TOOL_OPTIONS: -Xmx512M
                    openjdk version "18.0.1" 2022-04-19
                    OpenJDK Runtime Environment 22.3 (build 18.0.1+10)
                    OpenJDK 64-Bit Server VM 22.3 (build 18.0.1+10,
    mixed mode, sharing)


            ps ax | grep tika-server
               1003 ?        Ssl    0:12 /usr/bin/java -jar
    /srv/webapps/tika/tika-server.jar -c
    /usr/local/etc/tika/tika-server-config-custom.xml
               1143 ?        Sl     0:37 /usr/bin/java -Xms1g -Xmx1g
    -Dpdfbox.fontcache=/var/tika -Dlog4j2.info
    -Djava.awt.headless=true -cp /srv/webapps/tika/tika-server.jar
    -Dtika.server.id <http://Dtika.server.id>=
    org.apache.tika.server.core.TikaServerProcess -h 127.0.0.1 -p 9998
    -i  -c /usr/local/etc/tika/tika-server-config-custom.xml
    -forkedStatusFile
    /tmp/apache-tika-server-forked-tmp-9638775429532759882 -numRestarts 0

    it's invoked from a dovecot imap server instance, for attachment
    parsing,

            dovecot --version
                    2.3.19.1 (9b53102964)

            cat dovecot/conf.d/10-master.com <http://10-master.com>
                    ...
                    plugin {
                            ...
                            fts_tika = http://127.0.0.1:9998/tika/
                    }
                    ...

    on receipt of an email with a standard attachment/exmaple -- e.g.
    the example pdf @

    https://smallpdf.com/edit-pdf

    , per journal logs, the message is submitted to tika, but fails
    due to a 'corrupt stream'

            Jul 15 08:41:27 mx tika[1143]: INFO [qtp1837533591-27]
    08:41:27,224 org.apache.tika.server.core.resource.TikaResource
    /tika (application/pdf)
            Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
    08:41:27,453 org.apache.pdfbox.pdfparser.COSParser The end of the
    stream doesn't point to the correct offset, using workaround to
    read the stream, stream start position: 104315, length: 356,
    expected end position: 104671
            Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
    08:41:27,457 org.apache.pdfbox.filter.FlateFilter FlateFilter:
    stop reading corrupt stream due to a DataFormatException
            Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
    08:41:27,730 org.apache.pdfbox.pdfparser.COSParser The end of the
    stream doesn't point to the correct offset, using workaround to
    read the stream, stream start position: 101699, length: 1472,
    expected end position: 103171
            Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
    08:41:27,735 org.apache.pdfbox.filter.FlateFilter FlateFilter:
    stop reading corrupt stream due to a DataFormatException
            Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
    08:41:27,742 org.apache.pdfbox.pdfparser.COSParser The end of the
    stream doesn't point to the correct offset, using workaround to
    read the stream, stream start position: 101509, length: 66,
    expected end position: 101575
            Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
    08:41:27,744 org.apache.pdfbox.filter.FlateFilter FlateFilter:
    stop reading corrupt stream due to a DataFormatException
            Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
    08:41:27,748 org.apache.pdfbox.pdfparser.COSParser The end of the
    stream doesn't point to the correct offset, using workaround to
    read the stream, stream start position: 2011, length: 2482,
    expected end position: 4493
            Jul 15 08:41:27 mx tika[1143]: WARN [qtp1837533591-27]
    08:41:27,752 org.apache.tika.server.core.resource.TikaResource
    tika/: Text extraction failed (test.pdf)
            Jul 15 08:41:27 mx tika[1143]:
    org.apache.tika.exception.TikaException: TIKA-198: Illegal
    IOException from org.apache.tika.parser.pdf.PDFParser@356fdbd7
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.server.Server.handle(Server.java:516)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.io
    
<http://org.eclipse.jetty.io>.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.io
    <http://org.eclipse.jetty.io>.FillInterest.fillable(FillInterest.java:105)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.eclipse.jetty.io
    
<http://org.eclipse.jetty.io>.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    java.lang.Thread.run(Thread.java:833) ~[?:?]
            Jul 15 08:41:27 mx tika[1143]: Caused by:
    java.io.IOException: Page tree root must be a dictionary
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1230)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:284)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         at
    org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    ~[tika-server-standard-2.4.1.jar:2.4.1]
            Jul 15 08:41:27 mx tika[1143]:         ... 37 more
            Jul 15 08:41:27 mx tika[1143]: ERROR [qtp1837533591-27]
    08:41:27,767 org.apache.cxf.jaxrs.utils.JAXRSUtils Problem with
    writing the data, class
    
org.apache.tika.server.core.resource.TikaResource$$Lambda$337/0x0000000800eabbf8,
    ContentType: text/plain

    Is this likely an issue with tika-server itself? &/or java/dovecot?

    What additional diagnostics can help narrow it down?

Reply via email to