Yes, tesseract does. But if a 100 pages tiff is sent directly to tesseract, it could take minutes and trigger a timeout. Default ocr timeout is 120s and my experience says tesseract takes 3-4s per page in average.
Em qua, 19 de jan de 2022 21:54, Tim Allison <[email protected]> escreveu: > Sorry. You’re right. I think tesseract is supposed to handle multi page > tiffs on its own. > > On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <[email protected]> > wrote: > >> Hi Tim, >> >> I'm sure Tika does that for PDFs, but I couldn't find that logic in the >> code base for TIFFs. Could you point to me what class does that? >> >> Luis >> >> >> Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]> >> escreveu: >> >>> Yes. Exactly right. Tika spawns a process per page/image. >>> >>> On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg < >>> [email protected]> wrote: >>> >>>> I believe that Tika just OCR’s one page at a time. My guess is that it >>>> spawns a process for each page. >>>> >>>> >>>> >>>> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >>>> >>>> *C: 703.887.5623 * >>>> >>>> [image: Torch AI] <http://www.torch.ai/> >>>> >>>> 5250 W 116 >>>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th >>>> Pl, Suite 200., Leawood, KS 66211 >>>> WWW.TORCH.AI <http://www.torch.ai/> >>>> >>>> >>>> >>>> >>>> >>>> *From:* Luís Filipe Nassif <[email protected]> >>>> *Sent:* Wednesday, January 19, 2022 11:11 AM >>>> *To:* [email protected] >>>> *Subject:* Re: TesseractOCRParser timeout >>>> >>>> >>>> >>>> Just a guess, if you are OCRing multipage TIF files, that may be the >>>> reason, I "think" Tika sends the whole TIF to tesseract and that could take >>>> a large amount of time if there are lots of pages, triggering timeouts. In >>>> our project, we send each TIF page at a time to tesseract and restart the >>>> timeout counter to avoid this. >>>> >>>> >>>> >>>> Best regards, >>>> >>>> Luís Nassif >>>> >>>> >>>> >>>> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg < >>>> [email protected]> escreveu: >>>> >>>> Unrelated to my previous questions. I’m getting some sort of timeout >>>> in Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that >>>> say ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it >>>> spawn a separate process to do the OCR? We’re having some performance >>>> issues, so in a way, this doesn’t come as a surprise. Just trying to >>>> understand a little more what’s going on >>>> >>>> >>>> >>>> private void runOCRProcess(Process process, int timeout) throws >>>> IOException, TikaException { >>>> process.getOutputStream().close(); >>>> InputStream out = process.getInputStream(); >>>> InputStream err = process.getErrorStream(); >>>> StringBuilder outBuilder = new StringBuilder(); >>>> StringBuilder errBuilder = new StringBuilder(); >>>> Thread outThread = this.logStream(out, outBuilder); >>>> Thread errThread = this.logStream(err, errBuilder); >>>> outThread.start(); >>>> errThread.start(); >>>> int exitValue = -2147483648; >>>> >>>> try { >>>> boolean finished = process.waitFor((long)timeout, TimeUnit. >>>> *SECONDS*); >>>> if (!finished) { >>>> throw new TikaException("TesseractOCRParser timeout"); >>>> } >>>> >>>> exitValue = process.exitValue(); >>>> } catch (InterruptedException var12) { >>>> Thread.*currentThread*().interrupt(); >>>> throw new TikaException("TesseractOCRParser interrupted", >>>> var12); >>>> } catch (IllegalThreadStateException var13) { >>>> throw new TikaException("TesseractOCRParser timeout"); >>>> } >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >>>> >>>> *C: 703.887.5623* >>>> >>>> [image: Torch AI] >>>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> >>>> >>>> 5250 W 116 >>>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th >>>> Pl, Suite 200., Leawood, KS 66211 >>>> WWW.TORCH.AI >>>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> >>>> >>>> >>>> >>>> >>>> >>>>
