Hum, so forget my thoughts about timeouts with TIFs :-) Em qui, 20 de jan de 2022 00:08, Peter Kronenberg <[email protected]> escreveu:
> In my case, it’s a non-searchable PDF. So I assume that Tika converts > each page to a Tiff and then OCR’s it > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] <http://www.torch.ai/> > > 5250 W 116th Pl, Suite 200., Leawood, KS 66211 > WWW.TORCH.AI <http://www.torch.ai/> > > > > > > *From:* Tim Allison <[email protected]> > *Sent:* Wednesday, January 19, 2022 7:54 PM > *To:* [email protected] > *Cc:* [email protected] > *Subject:* Re: TesseractOCRParser timeout > > > > Sorry. You’re right. I think tesseract is supposed to handle multi page > tiffs on its own. > > > > On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <[email protected]> > wrote: > > Hi Tim, > > > > I'm sure Tika does that for PDFs, but I couldn't find that logic in the > code base for TIFFs. Could you point to me what class does that? > > > > Luis > > > > Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]> > escreveu: > > Yes. Exactly right. Tika spawns a process per page/image. > > > > On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg < > [email protected]> wrote: > > I believe that Tika just OCR’s one page at a time. My guess is that it > spawns a process for each page. > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425> > > 5250 W 116 > <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425> > th Pl, Suite 200., Leawood, KS 66211 > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425> > > > > > > *From:* Luís Filipe Nassif <[email protected]> > *Sent:* Wednesday, January 19, 2022 11:11 AM > *To:* [email protected] > *Subject:* Re: TesseractOCRParser timeout > > > > Just a guess, if you are OCRing multipage TIF files, that may be the > reason, I "think" Tika sends the whole TIF to tesseract and that could take > a large amount of time if there are lots of pages, triggering timeouts. In > our project, we send each TIF page at a time to tesseract and restart the > timeout counter to avoid this. > > > > Best regards, > > Luís Nassif > > > > Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg < > [email protected]> escreveu: > > Unrelated to my previous questions. I’m getting some sort of timeout in > Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that say > ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it > spawn a separate process to do the OCR? We’re having some performance > issues, so in a way, this doesn’t come as a surprise. Just trying to > understand a little more what’s going on > > > > private void runOCRProcess(Process process, int timeout) throws > IOException, TikaException { > process.getOutputStream().close(); > InputStream out = process.getInputStream(); > InputStream err = process.getErrorStream(); > StringBuilder outBuilder = new StringBuilder(); > StringBuilder errBuilder = new StringBuilder(); > Thread outThread = this.logStream(out, outBuilder); > Thread errThread = this.logStream(err, errBuilder); > outThread.start(); > errThread.start(); > int exitValue = -2147483648; > > try { > boolean finished = process.waitFor((long)timeout, TimeUnit. > *SECONDS*); > if (!finished) { > throw new TikaException("TesseractOCRParser timeout"); > } > > exitValue = process.exitValue(); > } catch (InterruptedException var12) { > Thread.*currentThread*().interrupt(); > throw new TikaException("TesseractOCRParser interrupted", var12); > } catch (IllegalThreadStateException var13) { > throw new TikaException("TesseractOCRParser timeout"); > } > > > > > > > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> > > 5250 W 116 > <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425> > th Pl, Suite 200., Leawood, KS 66211 > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> > > > > > >
