Sorry. You’re right. I think tesseract is supposed to handle multi page tiffs on its own.
On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <[email protected]> wrote: > Hi Tim, > > I'm sure Tika does that for PDFs, but I couldn't find that logic in the > code base for TIFFs. Could you point to me what class does that? > > Luis > > > Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]> > escreveu: > >> Yes. Exactly right. Tika spawns a process per page/image. >> >> On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg < >> [email protected]> wrote: >> >>> I believe that Tika just OCR’s one page at a time. My guess is that it >>> spawns a process for each page. >>> >>> >>> >>> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >>> >>> *C: 703.887.5623 * >>> >>> [image: Torch AI] <http://www.torch.ai/> >>> >>> 5250 W 116 >>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th >>> Pl, Suite 200., Leawood, KS 66211 >>> WWW.TORCH.AI <http://www.torch.ai/> >>> >>> >>> >>> >>> >>> *From:* Luís Filipe Nassif <[email protected]> >>> *Sent:* Wednesday, January 19, 2022 11:11 AM >>> *To:* [email protected] >>> *Subject:* Re: TesseractOCRParser timeout >>> >>> >>> >>> Just a guess, if you are OCRing multipage TIF files, that may be the >>> reason, I "think" Tika sends the whole TIF to tesseract and that could take >>> a large amount of time if there are lots of pages, triggering timeouts. In >>> our project, we send each TIF page at a time to tesseract and restart the >>> timeout counter to avoid this. >>> >>> >>> >>> Best regards, >>> >>> Luís Nassif >>> >>> >>> >>> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg < >>> [email protected]> escreveu: >>> >>> Unrelated to my previous questions. I’m getting some sort of timeout in >>> Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that say >>> ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it >>> spawn a separate process to do the OCR? We’re having some performance >>> issues, so in a way, this doesn’t come as a surprise. Just trying to >>> understand a little more what’s going on >>> >>> >>> >>> private void runOCRProcess(Process process, int timeout) throws >>> IOException, TikaException { >>> process.getOutputStream().close(); >>> InputStream out = process.getInputStream(); >>> InputStream err = process.getErrorStream(); >>> StringBuilder outBuilder = new StringBuilder(); >>> StringBuilder errBuilder = new StringBuilder(); >>> Thread outThread = this.logStream(out, outBuilder); >>> Thread errThread = this.logStream(err, errBuilder); >>> outThread.start(); >>> errThread.start(); >>> int exitValue = -2147483648; >>> >>> try { >>> boolean finished = process.waitFor((long)timeout, TimeUnit. >>> *SECONDS*); >>> if (!finished) { >>> throw new TikaException("TesseractOCRParser timeout"); >>> } >>> >>> exitValue = process.exitValue(); >>> } catch (InterruptedException var12) { >>> Thread.*currentThread*().interrupt(); >>> throw new TikaException("TesseractOCRParser interrupted", >>> var12); >>> } catch (IllegalThreadStateException var13) { >>> throw new TikaException("TesseractOCRParser timeout"); >>> } >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >>> >>> *C: 703.887.5623* >>> >>> [image: Torch AI] >>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> >>> >>> 5250 W 116 >>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th >>> Pl, Suite 200., Leawood, KS 66211 >>> WWW.TORCH.AI >>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> >>> >>> >>> >>> >>> >>>
