Partial OCR extractions under memory pressure

Harvey, Robin Thu, 07 Apr 2022 05:34:04 -0700

Hi,

We've hit an issue with the Tika server recently where large PDF documents
are only partially extracted when the server is under heavy load.  For
example, a 70 page PDF which is normally extracted fine suddenly returns as
just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to force OCR
and we have the timeout set to 600 seconds in the XML configuration file.
When a partial extraction happens, we get a 2xx response as normal, so it's
impossible to tell if the extraction actually worked or not.  By observing
the server logs whilst stress testing the Docker container, I can see that
the following exception is closely correlated with the error.


org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78)
~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169)
~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-server-standard-2.2.1.jar:2.2.1]
...snip...
Caused by: java.io.IOException: org.apache.tika.exception.TikaException:
TesseractOCRParser timeout
at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95)
~[tika-server-standard-2.2.1.jar:2.2.1]
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
~[tika-server-standard-2.2.1.jar:2.2.1]
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61)
~[tika-server-standard-2.2.1.jar:2.2.1]

Would you consider this to be a bug?  In my view it would be much better to
get some kind 5XX HTTP response when this error occurs.

Thanks,
--Robin

Partial OCR extractions under memory pressure

Reply via email to