Thanks Tim, that's really interesting and gives us something to work with. On Thu, Apr 7, 2022 at 7:13 PM Tim Allison <[email protected]> wrote:
> I looked more closely at this and did some testing with our MockParser > throwing an NPE. I then stumbled across earlier documentation that > re-confirmed my findings: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_TIKA_TikaServerEndpointsCompared&d=DwIFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=jaO_GHLdpm1_CPg8zsa6Vdwixm3ZbBbuCcqceE_lLOA&m=7zt2ZXsnK7NjabgyeqbyDincmaHuDOVE9AePZumSb6uxBuDW0U7W0pgs6eM387Sx&s=9B8I44OZuYR9-HJNBDAa5QRj40eXbc0Mc4sBd2pqdlc&e= > > In looking more closely at your stacktrace, we are letting that > exception percolate through the PDFParser. We are not incorrectly > catching it. The problem is that with any exception in the /tika > endpoint, if the exception happens after a certain amount of data has > been written, then our endpoint returns 200 and starts streaming the > results. You won't know through the client that there was an > exception...for any exception after a certain amount of data has been > written. This is true for the timeouts in tesseract and any other NPE > or other exception thrown during the parse. > > If you want to guarantee that you see exceptions, you can use the json > output option of the /tika endpoint (send "accept: application/json" > as a header). The downside to that is that it buffers the extracted > text in memory and then writes it all to json and returns it. So > there's a tradeoff. > > With the json output, I get a 200, but the stacktrace is returned in > the response: > > > {"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"author":"Nikolai > > Lobachevsky","X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"X-TIKA:EXCEPTION:container_exception":"org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.mock.MockParser@785b3ba9\n\tat > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)\n\tat > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\tat > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:188)\n\tat > > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)\n\tat > org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)\n\tat > > org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)\n\tat > > org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:598)\n\tat > > org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:571)\n\tat > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat > java.lang.reflect.Method.invoke(Method.java:498)\n\tat > > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat > > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat > org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat > org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat > > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat > > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat > > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat > > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat > > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)\n\tat > > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat > > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat > org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat > > org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\n\tat > org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\n\tat > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\n\tat > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat > org.eclipse.jetty.io > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat > org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\n\tat > java.lang.Thread.run(Thread.java:748)\nCaused by: > java.lang.NullPointerException: null pointer message\n\tat > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method)\n\tat > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat > java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat > org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:418)\n\tat > org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:364)\n\tat > > org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:152)\n\tat > org.apache.tika.parser.mock.MockParser.parse(MockParser.java:133)\n\tat > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\t... > 41 > more\n","X-TIKA:digest:SHA1":"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP","X-TIKA:content":"<html > xmlns=\" > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_1999_xhtml-255C&d=DwIFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=jaO_GHLdpm1_CPg8zsa6Vdwixm3ZbBbuCcqceE_lLOA&m=7zt2ZXsnK7NjabgyeqbyDincmaHuDOVE9AePZumSb6uxBuDW0U7W0pgs6eM387Sx&s=oyKwwTFcRO4gQzDNU1J8juIYxEOQvGF_siOOPJA4zjE&e= > ">\n<head>\n<meta > name=\"X-TIKA:Parsed-By\" > content=\"org.apache.tika.parser.DefaultParser\" />\n<meta > name=\"X-TIKA:Parsed-By\" > content=\"org.apache.tika.parser.mock.MockParser\" />\n<meta > name=\"author\" content=\"Nikolai Lobachevsky\" />\n<meta > name=\"X-TIKA:digest:SHA1\" > content=\"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP\" />\n<meta > name=\"X-TIKA:digest:MD5\" > content=\"0ce160383b1fc9add7b82819d6b7bb01\" />\n<meta > name=\"Content-Type\" content=\"application/mock+xml\" > />\n<title></title>\n</head>\n<body><p>some contentsome contentsome > contentsome contentsome contentsome contentsome contentsome > contentsome contentsome > > On Thu, Apr 7, 2022 at 1:13 PM Tim Allison <[email protected]> wrote: > > > > Thank you. This is a tricky one. That endpoint streams output. It > > doesn't buffer the results and then return results. That means that > > we have to return 200 and start streaming the extracted content. > > > > That said, I can look at percolating the exception through the > > PDFParser through the handler so that you'll get an exception from the > > server, as with any other parse exception. > > > > Please open an issue on our JIRA. > > > > Fellow devs, what do you think? > > > > On Thu, Apr 7, 2022 at 12:35 PM Harvey, Robin > > <[email protected]> wrote: > > > > > > The REST endpoint we're using is /rmeta/text, not totally sure which > handler TBH. The request looks like this: > > > > > > PUT /rmeta/text HTTP/1.1 > > > Host: localhost:9998 > > > User-Agent: python-requests/2.27.1 > > > Accept-Encoding: gzip, deflate > > > Accept: */* > > > Connection: keep-alive > > > X-Tika-PDFOcrStrategy: ocr_only > > > X-Tika-Skip-Embedded: true > > > Content-Length: 259385 > > > > > > > > > On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <[email protected]> > wrote: > > >> > > >> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly > with links and attachments. > > >> > > >> Y. I agree, I think. Which endpoint are you using /tika or /rmeta? > > >> Which handler, xhtml or text? > > >> > > >> > > >> The underlying issue is that we catch and hold on to IOExceptions per > > >> page in PDFs. We report them in the metadata in /rmeta, but those > > >> won't come through in /tika. > > >> > > >> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin < > [email protected]> wrote: > > >> > > > >> > Hi, > > >> > > > >> > We've hit an issue with the Tika server recently where large PDF > documents are only partially extracted when the server is under heavy > load. For example, a 70 page PDF which is normally extracted fine suddenly > returns as just 4 or 5 pages. We use the X-Tika-PDFOcrStrategy header to > force OCR and we have the timeout set to 600 seconds in the XML > configuration file. When a partial extraction happens, we get a 2xx > response as normal, so it's impossible to tell if the extraction actually > worked or not. By observing the server logs whilst stress testing the > Docker container, I can see that the following exception is closely > correlated with the error. > > >> > > > >> > org.apache.tika.exception.TikaException: Unable to extract PDF > content > > >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > ...snip... > > >> > Caused by: java.io.IOException: > org.apache.tika.exception.TikaException: TesseractOCRParser timeout > > >> > at > org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > >> > > > >> > Would you consider this to be a bug? In my view it would be much > better to get some kind 5XX HTTP response when this error occurs. > > >> > > > >> > Thanks, > > >> > --Robin >
