R: External Tika Server
In my tika server, I added: -spawnChild -taskTimeoutMillis 100 To bypass the timeout problem Mario Da: Furkan KAMACI Inviato: martedì 4 dicembre 2018 10:16 A: user@manifoldcf.apache.org; Rafa Haro Oggetto: Re: External Tika Server Hi Rafa, I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :) Kind Regards, Furkan KAMACI On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro mailto:rh...@apache.org>> wrote: Hi Furkan, You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side Rafa On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI mailto:furkankam...@gmail.com>> wrote: Hi, I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them): INFO meta (application/msword) WARN meta: Text extraction failed org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) at java.lang.Thread.run(Thread.java:748) Caused by:
Re: External Tika Server
Hi Rafa, I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :) Kind Regards, Furkan KAMACI On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro wrote: > Hi Furkan, > > You seem to be getting a Timeout from Tesseract. This might be happening > with large documents (too many pages). Maybe there is some configuration > parameter for increasing timeouts that you can use at Tika side > > Rafa > > On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI > wrote: > >> Hi, >> >> I try to test external OCR capabilities of Tika Server with ManifoldCF >> 2.11. Documents are parsed when I curl documents into Tika Server directly. >> However, when I try to parse them via Tika Server I get that error at >> *most* of the documents (not all of them): >> >> INFO meta (application/msword) >> WARN meta: Text extraction failed >> org.apache.tika.exception.TikaException: Unable to extract PDF content >> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) >> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) >> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) >> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) >> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) >> at >> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) >> at >> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) >> at >> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) >> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:498) >> at >> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) >> at >> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) >> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) >> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) >> at >> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) >> at >> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) >> at >> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) >> at >> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) >> at >> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) >> at >> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) >> at >> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) >> at >> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) >> at >> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) >> at >> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) >> at >> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) >> at >> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) >> at >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) >> at >> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) >> at >> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) >> at org.eclipse.jetty.server.Server.handle(Server.java:531) >> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) >> at >> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) >> at >> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) >> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) >> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) >> at >> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) >> at >> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) >> at >> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) >> at >> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) >> at >> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) >> at >> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) >> at >> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a >> page >> at >>
Re: External Tika Server
Hi Furkan, You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side Rafa On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI wrote: > Hi, > > I try to test external OCR capabilities of Tika Server with ManifoldCF > 2.11. Documents are parsed when I curl documents into Tika Server directly. > However, when I try to parse them via Tika Server I get that error at > *most* of the documents (not all of them): > > INFO meta (application/msword) > WARN meta: Text extraction failed > org.apache.tika.exception.TikaException: Unable to extract PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) > at > org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) > at > org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.eclipse.jetty.server.Server.handle(Server.java:531) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) > at > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) > at > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428) > at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > at >
External Tika Server
Hi, I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at *most* of the documents (not all of them): INFO meta (application/msword) WARN meta: Text extraction failed org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402) at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126) at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ... 44 more Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434) at