Re: [External] Re: Partial OCR extractions under memory pressure

Harvey, Robin Thu, 07 Apr 2022 09:35:48 -0700

The REST endpoint we're using is /rmeta/text, not totally sure which
handler TBH.  The request looks like this:


PUT /rmeta/text HTTP/1.1
Host: localhost:9998
User-Agent: python-requests/2.27.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
X-Tika-PDFOcrStrategy: ocr_only
X-Tika-Skip-Embedded: true
Content-Length: 259385


On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <[email protected]> wrote:

> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with
> links and attachments.
>
> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
> Which handler, xhtml or text?
>
>
> The underlying issue is that we catch and hold on to IOExceptions per
> page in PDFs.  We report them in the metadata in /rmeta, but those
> won't come through in /tika.
>
> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <[email protected]>
> wrote:
> >
> > Hi,
> >
> > We've hit an issue with the Tika server recently where large PDF
> documents are only partially extracted when the server is under heavy
> load.  For example, a 70 page PDF which is normally extracted fine suddenly
> returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to
> force OCR and we have the timeout set to 600 seconds in the XML
> configuration file.  When a partial extraction happens, we get a 2xx
> response as normal, so it's impossible to tell if the extraction actually
> worked or not.  By observing the server logs whilst stress testing the
> Docker container, I can see that the following exception is closely
> correlated with the error.
> >
> > org.apache.tika.exception.TikaException: Unable to extract PDF content
> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > ...snip...
> > Caused by: java.io.IOException: org.apache.tika.exception.TikaException:
> TesseractOCRParser timeout
> > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> >
> > Would you consider this to be a bug?  In my view it would be much better
> to get some kind 5XX HTTP response when this error occurs.
> >
> > Thanks,
> > --Robin
>

Re: [External] Re: Partial OCR extractions under memory pressure

Reply via email to