The REST endpoint we're using is /rmeta/text, not totally sure which handler TBH. The request looks like this:
PUT /rmeta/text HTTP/1.1 Host: localhost:9998 User-Agent: python-requests/2.27.1 Accept-Encoding: gzip, deflate Accept: */* Connection: keep-alive X-Tika-PDFOcrStrategy: ocr_only X-Tika-Skip-Embedded: true Content-Length: 259385 On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <[email protected]> wrote: > This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with > links and attachments. > > Y. I agree, I think. Which endpoint are you using /tika or /rmeta? > Which handler, xhtml or text? > > > The underlying issue is that we catch and hold on to IOExceptions per > page in PDFs. We report them in the metadata in /rmeta, but those > won't come through in /tika. > > On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <[email protected]> > wrote: > > > > Hi, > > > > We've hit an issue with the Tika server recently where large PDF > documents are only partially extracted when the server is under heavy > load. For example, a 70 page PDF which is normally extracted fine suddenly > returns as just 4 or 5 pages. We use the X-Tika-PDFOcrStrategy header to > force OCR and we have the timeout set to 600 seconds in the XML > configuration file. When a partial extraction happens, we get a 2xx > response as normal, so it's impossible to tell if the extraction actually > worked or not. By observing the server logs whilst stress testing the > Docker container, I can see that the following exception is closely > correlated with the error. > > > > org.apache.tika.exception.TikaException: Unable to extract PDF content > > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > ...snip... > > Caused by: java.io.IOException: org.apache.tika.exception.TikaException: > TesseractOCRParser timeout > > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) > ~[tika-server-standard-2.2.1.jar:2.2.1] > > > > Would you consider this to be a bug? In my view it would be much better > to get some kind 5XX HTTP response when this error occurs. > > > > Thanks, > > --Robin >
