Hallo. Which is your tika server version? You could try to download last build version from here, to check if it works.
https://builds.apache.org/job/Tika-trunk/lastStableBuild/ Da: Furkan KAMACI <[email protected]> Inviato: mercoledì 5 dicembre 2018 13:37 A: [email protected] Cc: Rafa Haro <[email protected]> Oggetto: Re: External Tika Server Hi Mario, Thanks for the answer. I still get an error message at a pdf at which parsing via HTTP works but via ManifoldCF doesn't. I get that error: WARN meta: Text extraction failed org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7e76e3f5<mailto:org.apache.tika.parser.pdf.PDFParser@7e76e3f5> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) at java.lang.Thread.run(Thread.java:748) Caused by: java.awt.image.RasterFormatException: (y + height) is outside raster at sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470) at sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514) at sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106) at sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201) at sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159) at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68) at sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164) at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160) at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527) at org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418) at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759) at org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238) at org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112) at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641) at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484) at org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425) at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 42 more INFO tika (application/pdf) WARN No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10 On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: In my tika server, I added: -spawnChild -taskTimeoutMillis 1000000 To bypass the timeout problem Mario Da: Furkan KAMACI <[email protected]<mailto:[email protected]>> Inviato: martedì 4 dicembre 2018 10:16 A: [email protected]<mailto:[email protected]>; Rafa Haro <[email protected]<mailto:[email protected]>> Oggetto: Re: External Tika Server Hi Rafa, I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :) Kind Regards, Furkan KAMACI On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <[email protected]<mailto:[email protected]>> wrote: Hi Furkan, You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side Rafa On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <[email protected]<mailto:[email protected]>> wrote: Hi, I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them): INFO meta (application/msword) WARN meta: Text extraction failed org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ... 44 more Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434) at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338) at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418) ... 50 more Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551) ... 55 more How can I solve it? Kind Regards, Furkan KAMACI
