Y. I agree, I think. Which endpoint are you using /tika or /rmeta? Which handler, xhtml or text?
The underlying issue is that we catch and hold on to IOExceptions per page in PDFs. We report them in the metadata in /rmeta, but those won't come through in /tika. On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <[email protected]> wrote: > > Hi, > > We've hit an issue with the Tika server recently where large PDF documents > are only partially extracted when the server is under heavy load. For > example, a 70 page PDF which is normally extracted fine suddenly returns as > just 4 or 5 pages. We use the X-Tika-PDFOcrStrategy header to force OCR and > we have the timeout set to 600 seconds in the XML configuration file. When a > partial extraction happens, we get a 2xx response as normal, so it's > impossible to tell if the extraction actually worked or not. By observing > the server logs whilst stress testing the Docker container, I can see that > the following exception is closely correlated with the error. > > org.apache.tika.exception.TikaException: Unable to extract PDF content > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) > ~[tika-server-standard-2.2.1.jar:2.2.1] > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) > ~[tika-server-standard-2.2.1.jar:2.2.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > ~[tika-server-standard-2.2.1.jar:2.2.1] > ...snip... > Caused by: java.io.IOException: org.apache.tika.exception.TikaException: > TesseractOCRParser timeout > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) > ~[tika-server-standard-2.2.1.jar:2.2.1] > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) > ~[tika-server-standard-2.2.1.jar:2.2.1] > at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) > ~[tika-server-standard-2.2.1.jar:2.2.1] > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > Would you consider this to be a bug? In my view it would be much better to > get some kind 5XX HTTP response when this error occurs. > > Thanks, > --Robin
