Hi Rafa,

I can parse same document via HTTP URL of Tika Server. I thought that there
maybe a timeout parameter within ManifoldCF while communicating with Tika
Server :)

Kind Regards,
Furkan KAMACI

On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <[email protected]> wrote:

> Hi Furkan,
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
> Rafa
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <[email protected]>
> wrote:
>
>> Hi,
>>
>> I try to test external OCR capabilities of Tika Server with ManifoldCF
>> 2.11. Documents are parsed when I curl documents into Tika Server directly.
>> However, when I try to parse them via Tika Server I get that error at
>> *most* of the documents (not all of them):
>>
>> INFO  meta (application/msword)
>> WARN  meta: Text extraction failed
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>> at
>> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>> at
>> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>> at
>> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>> at
>> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>> at
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>> at
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>> at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>> at
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>> at
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>> at
>> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>> at
>> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>> at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>> at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>> at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>> at
>> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a
>> page
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>> ... 44 more
>> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
>> timeout
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>> ... 50 more
>> Caused by: java.util.concurrent.TimeoutException
>> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>> ... 55 more
>>
>> How can I solve it?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>

Reply via email to