Hi Tim, I'm sure Tika does that for PDFs, but I couldn't find that logic in the code base for TIFFs. Could you point to me what class does that?
Luis Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]> escreveu: > Yes. Exactly right. Tika spawns a process per page/image. > > On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg < > [email protected]> wrote: > >> I believe that Tika just OCR’s one page at a time. My guess is that it >> spawns a process for each page. >> >> >> >> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >> >> *C: 703.887.5623 * >> >> [image: Torch AI] <http://www.torch.ai/> >> >> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 >> WWW.TORCH.AI <http://www.torch.ai/> >> >> >> >> >> >> *From:* Luís Filipe Nassif <[email protected]> >> *Sent:* Wednesday, January 19, 2022 11:11 AM >> *To:* [email protected] >> *Subject:* Re: TesseractOCRParser timeout >> >> >> >> Just a guess, if you are OCRing multipage TIF files, that may be the >> reason, I "think" Tika sends the whole TIF to tesseract and that could take >> a large amount of time if there are lots of pages, triggering timeouts. In >> our project, we send each TIF page at a time to tesseract and restart the >> timeout counter to avoid this. >> >> >> >> Best regards, >> >> Luís Nassif >> >> >> >> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg < >> [email protected]> escreveu: >> >> Unrelated to my previous questions. I’m getting some sort of timeout in >> Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that say >> ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it >> spawn a separate process to do the OCR? We’re having some performance >> issues, so in a way, this doesn’t come as a surprise. Just trying to >> understand a little more what’s going on >> >> >> >> private void runOCRProcess(Process process, int timeout) throws >> IOException, TikaException { >> process.getOutputStream().close(); >> InputStream out = process.getInputStream(); >> InputStream err = process.getErrorStream(); >> StringBuilder outBuilder = new StringBuilder(); >> StringBuilder errBuilder = new StringBuilder(); >> Thread outThread = this.logStream(out, outBuilder); >> Thread errThread = this.logStream(err, errBuilder); >> outThread.start(); >> errThread.start(); >> int exitValue = -2147483648; >> >> try { >> boolean finished = process.waitFor((long)timeout, TimeUnit. >> *SECONDS*); >> if (!finished) { >> throw new TikaException("TesseractOCRParser timeout"); >> } >> >> exitValue = process.exitValue(); >> } catch (InterruptedException var12) { >> Thread.*currentThread*().interrupt(); >> throw new TikaException("TesseractOCRParser interrupted", var12); >> } catch (IllegalThreadStateException var13) { >> throw new TikaException("TesseractOCRParser timeout"); >> } >> >> >> >> >> >> >> >> >> >> *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * >> >> *C: 703.887.5623* >> >> [image: Torch AI] >> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> >> >> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 >> WWW.TORCH.AI >> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> >> >> >> >> >> >>
