I looked more closely at this and did some testing with our MockParser
throwing an NPE.  I then stumbled across earlier documentation that
re-confirmed my findings:
https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared

In looking more closely at your stacktrace, we are letting that
exception percolate through the PDFParser.  We are not incorrectly
catching it.  The problem is that with any exception in the /tika
endpoint, if the exception happens after a certain amount of data has
been written, then our endpoint returns 200 and starts streaming the
results.  You won't know through the client that there was an
exception...for any exception after a certain amount of data has been
written.  This is true for the timeouts in tesseract and any other NPE
or other exception thrown during the parse.

If you want to guarantee that you see exceptions, you can use the json
output option of the /tika endpoint (send "accept: application/json"
as a header).  The downside to that is that it buffers the extracted
text in memory and then writes it all to json and returns it.  So
there's a tradeoff.

With the json output, I get a 200, but the stacktrace is returned in
the response:

{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"author":"Nikolai
Lobachevsky","X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"X-TIKA:EXCEPTION:container_exception":"org.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.mock.MockParser@785b3ba9\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:188)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)\n\tat
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)\n\tat
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)\n\tat
org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:598)\n\tat
org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:571)\n\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
java.lang.reflect.Method.invoke(Method.java:498)\n\tat
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat
org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat
org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)\n\tat
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\n\tat
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\n\tat
java.lang.Thread.run(Thread.java:748)\nCaused by:
java.lang.NullPointerException: null pointer message\n\tat
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)\n\tat 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat
java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat
org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:418)\n\tat
org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:364)\n\tat
org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:152)\n\tat
org.apache.tika.parser.mock.MockParser.parse(MockParser.java:133)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\t...
41 
more\n","X-TIKA:digest:SHA1":"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP","X-TIKA:content":"<html
xmlns=\"http://www.w3.org/1999/xhtml\";>\n<head>\n<meta
name=\"X-TIKA:Parsed-By\"
content=\"org.apache.tika.parser.DefaultParser\" />\n<meta
name=\"X-TIKA:Parsed-By\"
content=\"org.apache.tika.parser.mock.MockParser\" />\n<meta
name=\"author\" content=\"Nikolai Lobachevsky\" />\n<meta
name=\"X-TIKA:digest:SHA1\"
content=\"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP\" />\n<meta
name=\"X-TIKA:digest:MD5\"
content=\"0ce160383b1fc9add7b82819d6b7bb01\" />\n<meta
name=\"Content-Type\" content=\"application/mock+xml\"
/>\n<title></title>\n</head>\n<body><p>some contentsome contentsome
contentsome contentsome contentsome contentsome contentsome
contentsome contentsome

On Thu, Apr 7, 2022 at 1:13 PM Tim Allison <[email protected]> wrote:
>
> Thank you.  This is a tricky one.  That endpoint streams output.  It
> doesn't buffer the results and then return results.  That means that
> we have to return 200 and start streaming the extracted content.
>
> That said, I can look at percolating the exception through the
> PDFParser through the handler so that you'll get an exception from the
> server, as with any other parse exception.
>
> Please open an issue on our JIRA.
>
> Fellow devs, what do you think?
>
> On Thu, Apr 7, 2022 at 12:35 PM Harvey, Robin
> <[email protected]> wrote:
> >
> > The REST endpoint we're using is /rmeta/text, not totally sure which 
> > handler TBH.  The request looks like this:
> >
> > PUT /rmeta/text HTTP/1.1
> > Host: localhost:9998
> > User-Agent: python-requests/2.27.1
> > Accept-Encoding: gzip, deflate
> > Accept: */*
> > Connection: keep-alive
> > X-Tika-PDFOcrStrategy: ocr_only
> > X-Tika-Skip-Embedded: true
> > Content-Length: 259385
> >
> >
> > On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <[email protected]> wrote:
> >>
> >> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with 
> >> links and attachments.
> >>
> >> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
> >> Which handler, xhtml or text?
> >>
> >>
> >> The underlying issue is that we catch and hold on to IOExceptions per
> >> page in PDFs.  We report them in the metadata in /rmeta, but those
> >> won't come through in /tika.
> >>
> >> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <[email protected]> 
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We've hit an issue with the Tika server recently where large PDF 
> >> > documents are only partially extracted when the server is under heavy 
> >> > load.  For example, a 70 page PDF which is normally extracted fine 
> >> > suddenly returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy 
> >> > header to force OCR and we have the timeout set to 600 seconds in the 
> >> > XML configuration file.  When a partial extraction happens, we get a 2xx 
> >> > response as normal, so it's impossible to tell if the extraction 
> >> > actually worked or not.  By observing the server logs whilst stress 
> >> > testing the Docker container, I can see that the following exception is 
> >> > closely correlated with the error.
> >> >
> >> > org.apache.tika.exception.TikaException: Unable to extract PDF content
> >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) 
> >> > ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) 
> >> > ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at 
> >> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) 
> >> > ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > ...snip...
> >> > Caused by: java.io.IOException: org.apache.tika.exception.TikaException: 
> >> > TesseractOCRParser timeout
> >> > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) 
> >> > ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at 
> >> > org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
> >> >  ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at 
> >> > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> >> >  ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) 
> >> > ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> >
> >> > Would you consider this to be a bug?  In my view it would be much better 
> >> > to get some kind 5XX HTTP response when this error occurs.
> >> >
> >> > Thanks,
> >> > --Robin

Reply via email to