Re: [External] Re: Partial OCR extractions under memory pressure

Harvey, Robin Fri, 08 Apr 2022 01:43:32 -0700

Thanks Tim, that's really interesting and gives us something to work with.

On Thu, Apr 7, 2022 at 7:13 PM Tim Allison <[email protected]> wrote:


> I looked more closely at this and did some testing with our MockParser
> throwing an NPE.  I then stumbled across earlier documentation that
> re-confirmed my findings:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_TIKA_TikaServerEndpointsCompared&d=DwIFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=jaO_GHLdpm1_CPg8zsa6Vdwixm3ZbBbuCcqceE_lLOA&m=7zt2ZXsnK7NjabgyeqbyDincmaHuDOVE9AePZumSb6uxBuDW0U7W0pgs6eM387Sx&s=9B8I44OZuYR9-HJNBDAa5QRj40eXbc0Mc4sBd2pqdlc&e=
>
> In looking more closely at your stacktrace, we are letting that
> exception percolate through the PDFParser.  We are not incorrectly
> catching it.  The problem is that with any exception in the /tika
> endpoint, if the exception happens after a certain amount of data has
> been written, then our endpoint returns 200 and starts streaming the
> results.  You won't know through the client that there was an
> exception...for any exception after a certain amount of data has been
> written.  This is true for the timeouts in tesseract and any other NPE
> or other exception thrown during the parse.
>
> If you want to guarantee that you see exceptions, you can use the json
> output option of the /tika endpoint (send "accept: application/json"
> as a header).  The downside to that is that it buffers the extracted
> text in memory and then writes it all to json and returns it.  So
> there's a tradeoff.
>
> With the json output, I get a 200, but the stacktrace is returned in
> the response:
>
>
> {"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"author":"Nikolai
>
> Lobachevsky","X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"X-TIKA:EXCEPTION:container_exception":"org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.mock.MockParser@785b3ba9\n\tat
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)\n\tat
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\tat
>
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:188)\n\tat
>
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)\n\tat
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)\n\tat
>
> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)\n\tat
>
> org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:598)\n\tat
>
> org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:571)\n\tat
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
> java.lang.reflect.Method.invoke(Method.java:498)\n\tat
>
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat
>
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat
>
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat
>
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat
>
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat
>
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat
>
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)\n\tat
>
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat
>
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat
>
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\n\tat
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\n\tat
>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat
> org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat
> org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\n\tat
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\n\tat
> java.lang.Thread.run(Thread.java:748)\nCaused by:
> java.lang.NullPointerException: null pointer message\n\tat
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)\n\tat
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat
> org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:418)\n\tat
> org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:364)\n\tat
>
> org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:152)\n\tat
> org.apache.tika.parser.mock.MockParser.parse(MockParser.java:133)\n\tat
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\t...
> 41
> more\n","X-TIKA:digest:SHA1":"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP","X-TIKA:content":"<html
> xmlns=\"
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_1999_xhtml-255C&d=DwIFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=jaO_GHLdpm1_CPg8zsa6Vdwixm3ZbBbuCcqceE_lLOA&m=7zt2ZXsnK7NjabgyeqbyDincmaHuDOVE9AePZumSb6uxBuDW0U7W0pgs6eM387Sx&s=oyKwwTFcRO4gQzDNU1J8juIYxEOQvGF_siOOPJA4zjE&e=
> ">\n<head>\n<meta
> name=\"X-TIKA:Parsed-By\"
> content=\"org.apache.tika.parser.DefaultParser\" />\n<meta
> name=\"X-TIKA:Parsed-By\"
> content=\"org.apache.tika.parser.mock.MockParser\" />\n<meta
> name=\"author\" content=\"Nikolai Lobachevsky\" />\n<meta
> name=\"X-TIKA:digest:SHA1\"
> content=\"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP\" />\n<meta
> name=\"X-TIKA:digest:MD5\"
> content=\"0ce160383b1fc9add7b82819d6b7bb01\" />\n<meta
> name=\"Content-Type\" content=\"application/mock+xml\"
> />\n<title></title>\n</head>\n<body><p>some contentsome contentsome
> contentsome contentsome contentsome contentsome contentsome
> contentsome contentsome
>
> On Thu, Apr 7, 2022 at 1:13 PM Tim Allison <[email protected]> wrote:
> >
> > Thank you.  This is a tricky one.  That endpoint streams output.  It
> > doesn't buffer the results and then return results.  That means that
> > we have to return 200 and start streaming the extracted content.
> >
> > That said, I can look at percolating the exception through the
> > PDFParser through the handler so that you'll get an exception from the
> > server, as with any other parse exception.
> >
> > Please open an issue on our JIRA.
> >
> > Fellow devs, what do you think?
> >
> > On Thu, Apr 7, 2022 at 12:35 PM Harvey, Robin
> > <[email protected]> wrote:
> > >
> > > The REST endpoint we're using is /rmeta/text, not totally sure which
> handler TBH.  The request looks like this:
> > >
> > > PUT /rmeta/text HTTP/1.1
> > > Host: localhost:9998
> > > User-Agent: python-requests/2.27.1
> > > Accept-Encoding: gzip, deflate
> > > Accept: */*
> > > Connection: keep-alive
> > > X-Tika-PDFOcrStrategy: ocr_only
> > > X-Tika-Skip-Embedded: true
> > > Content-Length: 259385
> > >
> > >
> > > On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <[email protected]>
> wrote:
> > >>
> > >> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly
> with links and attachments.
> > >>
> > >> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
> > >> Which handler, xhtml or text?
> > >>
> > >>
> > >> The underlying issue is that we catch and hold on to IOExceptions per
> > >> page in PDFs.  We report them in the metadata in /rmeta, but those
> > >> won't come through in /tika.
> > >>
> > >> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <
> [email protected]> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > We've hit an issue with the Tika server recently where large PDF
> documents are only partially extracted when the server is under heavy
> load.  For example, a 70 page PDF which is normally extracted fine suddenly
> returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to
> force OCR and we have the timeout set to 600 seconds in the XML
> configuration file.  When a partial extraction happens, we get a 2xx
> response as normal, so it's impossible to tell if the extraction actually
> worked or not.  By observing the server logs whilst stress testing the
> Docker container, I can see that the following exception is closely
> correlated with the error.
> > >> >
> > >> > org.apache.tika.exception.TikaException: Unable to extract PDF
> content
> > >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > ...snip...
> > >> > Caused by: java.io.IOException:
> org.apache.tika.exception.TikaException: TesseractOCRParser timeout
> > >> > at
> org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> >
> > >> > Would you consider this to be a bug?  In my view it would be much
> better to get some kind 5XX HTTP response when this error occurs.
> > >> >
> > >> > Thanks,
> > >> > --Robin
>

Re: [External] Re: Partial OCR extractions under memory pressure

Reply via email to