Hi,
I'd say this is either a bug in jempbox or in tika because of a bad
file. I agree tika shouldn't fail the entire request. At the very least
it's a documentation bug in jempbox. The current javadoc does not
mention what would happen with an incorrect value.
IMHO we should catch it in tika (PDMetadataExtractor) because this would
be only 1 step, i.e. we wouldn't have to wait for it to be fixed in jempbox.
Tilman
PS: if you want you can resubmit your JIRA application, I'd approve it.
Just use the same name so I know it's you.
On 07.04.2025 12:05, siim kurvet wrote:
Hi,
I ran into a problem with Tika-server where pdf parsing fails
seemingly because pdf picture metadata xmp:Rating value is string not
expected integer[0-5].
Error with using: Tika-server 3.1.0.0-full docker image with Tesseract
OCR configured to extractInlineImages=true
Seemingly the cause of the error is PDF containing picture with
metadata: xmp:Rating="2.0"
Error:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.pdf.PDFParser@1c82c055
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:363)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:594)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:567)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
~[?:?]
at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
at
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:244)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:80)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:178)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.eclipse.jetty.server.Server.handle(Server.java:563)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
~[tika-server-standard-3.1.0.jar:3.1.0]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.NumberFormatException: For input string: "2.0"
at
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
~[?:?]
at java.base/java.lang.Integer.parseInt(Integer.java:662) ~[?:?]
at java.base/java.lang.Integer.<init>(Integer.java:1119) ~[?:?]
at
org.apache.jempbox.xmp.XMPSchema.getIntegerProperty(XMPSchema.java:311)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.jempbox.xmp.XMPSchemaBasic.getRating(XMPSchemaBasic.java:309)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.PDMetadataExtractor.extractBasic(PDMetadataExtractor.java:310)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.PDMetadataExtractor.extract(PDMetadataExtractor.java:79)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.PDMetadataExtractor.extract(PDMetadataExtractor.java:75)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.image.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:417)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.image.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:290)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:78)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:919)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:552)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:510)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:157)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.image.ImageGraphicsEngine.run(ImageGraphicsEngine.java:235)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:203)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:296)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1362)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:252)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107)
~[tika-server-standard-3.1.0.jar:3.1.0]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219)
~[tika-server-standard-3.1.0.jar:3.1.0]
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
~[tika-server-standard-3.1.0.jar:3.1.0]
... 38 more
Had to disable Tesseract ExtractInlineImages conf to process these
kinds of PDFs.
Is this a bug in tika logic or is there some sort of setting or
workaround that I'm missing? maybe tika could just ignore incorrect
xmp:Rating values with a warning, not fail the whole PDF processing?
Many thanks,
Siim