Hi Mario, Thanks for the answer. I still get an error message at a pdf at which parsing via HTTP works but via ManifoldCF doesn't. I get that error:
WARN meta: Text extraction failed org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7e76e3f5 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) at java.lang.Thread.run(Thread.java:748) Caused by: java.awt.image.RasterFormatException: (y + height) is outside raster at sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470) at sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514) at sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106) at sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201) at sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159) at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68) at sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164) at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160) at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527) at org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418) at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759) at org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) at org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112) at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641) at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484) at org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425) at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 42 more INFO tika (application/pdf) WARN No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10 On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <[email protected]> wrote: > > > In my tika server, I added: > > -spawnChild -taskTimeoutMillis 1000000 > > To bypass the timeout problem > > > > Mario > > > > > > *Da:* Furkan KAMACI <[email protected]> > *Inviato:* martedì 4 dicembre 2018 10:16 > *A:* [email protected]; Rafa Haro <[email protected]> > *Oggetto:* Re: External Tika Server > > > > Hi Rafa, > > > > I can parse same document via HTTP URL of Tika Server. I thought that > there maybe a timeout parameter within ManifoldCF while communicating with > Tika Server :) > > > > Kind Regards, > > Furkan KAMACI > > > > On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <[email protected]> wrote: > > Hi Furkan, > > > > You seem to be getting a Timeout from Tesseract. This might be happening > with large documents (too many pages). Maybe there is some configuration > parameter for increasing timeouts that you can use at Tika side > > > > Rafa > > > > On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <[email protected]> > wrote: > > Hi, > > > > I try to test external OCR capabilities of Tika Server with ManifoldCF > 2.11. Documents are parsed when I curl documents into Tika Server directly. > However, when I try to parse them via Tika Server I get that error at > *most* of the documents (not all of them): > > > > INFO meta (application/msword) > > WARN meta: Text extraction failed > > org.apache.tika.exception.TikaException: Unable to extract PDF content > > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > > at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) > > at > org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) > > at > org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) > > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) > > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) > > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) > > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) > > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) > > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) > > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) > > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) > > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) > > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) > > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) > > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) > > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) > > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > > at org.eclipse.jetty.server.Server.handle(Server.java:531) > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) > > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) > > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) > > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) > > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) > > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) > > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) > > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) > > at > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) > > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) > > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page > > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428) > > at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162) > > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) > > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > > ... 44 more > > Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser > timeout > > at > org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562) > > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434) > > at > org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338) > > at > org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310) > > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337) > > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418) > > ... 50 more > > Caused by: java.util.concurrent.TimeoutException > > at java.util.concurrent.FutureTask.get(FutureTask.java:205) > > at > org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551) > > ... 55 more > > > > How can I solve it? > > > > Kind Regards, > > Furkan KAMACI > >
