Re: Partial OCR extractions under memory pressure

Tim Allison Thu, 07 Apr 2022 06:45:54 -0700

Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
Which handler, xhtml or text?



The underlying issue is that we catch and hold on to IOExceptions per
page in PDFs.  We report them in the metadata in /rmeta, but those
won't come through in /tika.

On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <[email protected]> wrote:
>
> Hi,
>
> We've hit an issue with the Tika server recently where large PDF documents 
> are only partially extracted when the server is under heavy load.  For 
> example, a 70 page PDF which is normally extracted fine suddenly returns as 
> just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to force OCR and 
> we have the timeout set to 600 seconds in the XML configuration file.  When a 
> partial extraction happens, we get a 2xx response as normal, so it's 
> impossible to tell if the extraction actually worked or not.  By observing 
> the server logs whilst stress testing the Docker container, I can see that 
> the following exception is closely correlated with the error.
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) 
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) 
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) 
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> ...snip...
> Caused by: java.io.IOException: org.apache.tika.exception.TikaException: 
> TesseractOCRParser timeout
> at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) 
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
>  ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) 
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) 
> ~[tika-server-standard-2.2.1.jar:2.2.1]
>
> Would you consider this to be a bug?  In my view it would be much better to 
> get some kind 5XX HTTP response when this error occurs.
>
> Thanks,
> --Robin

Re: Partial OCR extractions under memory pressure

Reply via email to