Hi, We've hit an issue with the Tika server recently where large PDF documents are only partially extracted when the server is under heavy load. For example, a 70 page PDF which is normally extracted fine suddenly returns as just 4 or 5 pages. We use the X-Tika-PDFOcrStrategy header to force OCR and we have the timeout set to 600 seconds in the XML configuration file. When a partial extraction happens, we get a 2xx response as normal, so it's impossible to tell if the extraction actually worked or not. By observing the server logs whilst stress testing the Docker container, I can see that the following exception is closely correlated with the error.
org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) ~[tika-server-standard-2.2.1.jar:2.2.1] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) ~[tika-server-standard-2.2.1.jar:2.2.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) ~[tika-server-standard-2.2.1.jar:2.2.1] ...snip... Caused by: java.io.IOException: org.apache.tika.exception.TikaException: TesseractOCRParser timeout at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) ~[tika-server-standard-2.2.1.jar:2.2.1] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) ~[tika-server-standard-2.2.1.jar:2.2.1] at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-2.2.1.jar:2.2.1] at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) ~[tika-server-standard-2.2.1.jar:2.2.1] Would you consider this to be a bug? In my view it would be much better to get some kind 5XX HTTP response when this error occurs. Thanks, --Robin
