Y, we should at least catch this in Tika. This should not be a showstopper.

https://issues.apache.org/jira/browse/TIKA-4401

On Mon, Apr 7, 2025 at 6:34 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Hi,
>
> I'd say this is either a bug in jempbox or in tika because of a bad file.
> I agree tika shouldn't fail the entire request. At the very least it's a
> documentation bug in jempbox. The current javadoc does not mention what
> would happen with an incorrect value.
>
> IMHO we should catch it in tika (PDMetadataExtractor) because this would
> be only 1 step, i.e. we wouldn't have to wait for it to be fixed in jempbox.
>
> Tilman
> PS: if you want you can resubmit your JIRA application, I'd approve it.
> Just use the same name so I know it's you.
>
> On 07.04.2025 12:05, siim kurvet wrote:
>
> Hi,
>
> I ran into a problem with Tika-server where pdf parsing fails seemingly
> because pdf picture metadata xmp:Rating value is string not expected
> integer[0-5].
> Error with using: Tika-server 3.1.0.0-full docker image with Tesseract OCR
> configured to extractInlineImages=true
>
> Seemingly the cause of the error is PDF containing picture with metadata:
> xmp:Rating="2.0"
>
> Error:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@1c82c055
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:363)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:594)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:567)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> ~[?:?]
> at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:244)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:80)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:178)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.eclipse.jetty.server.Server.handle(Server.java:563)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
> Caused by: java.lang.NumberFormatException: For input string: "2.0"
> at
> java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
> ~[?:?]
> at java.base/java.lang.Integer.parseInt(Integer.java:662) ~[?:?]
> at java.base/java.lang.Integer.<init>(Integer.java:1119) ~[?:?]
> at org.apache.jempbox.xmp.XMPSchema.getIntegerProperty(XMPSchema.java:311)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.jempbox.xmp.XMPSchemaBasic.getRating(XMPSchemaBasic.java:309)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.PDMetadataExtractor.extractBasic(PDMetadataExtractor.java:310)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.PDMetadataExtractor.extract(PDMetadataExtractor.java:79)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.PDMetadataExtractor.extract(PDMetadataExtractor.java:75)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.image.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:417)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.image.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:290)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:78)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:919)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:552)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:510)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:157)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.image.ImageGraphicsEngine.run(ImageGraphicsEngine.java:235)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:203)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:296)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1362)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:252)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ~[tika-server-standard-3.1.0.jar:3.1.0]
> ... 38 more
>
> Had to disable Tesseract ExtractInlineImages conf to process these kinds
> of PDFs.
>
> Is this a bug in tika logic or is there some sort of setting or workaround
> that I'm missing? maybe tika could just ignore incorrect xmp:Rating values
> with a warning, not fail the whole PDF processing?
>
> Many thanks,
> Siim
>
>
>

Reply via email to