Tika processes each page sequentially in a single thread per document. A one hundred page PDF that requires OCR on each page will take 100 processes, sequentially.
On Mon, Jan 24, 2022 at 3:47 PM Peter Kronenberg <[email protected]> wrote: > Is there a way to control how many processes Tika will create at one > time? For example, if I have a 100 page document, will it create 100 > processes, 1 for each page? If there a way to control this, perhaps a > batch size? > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] <http://www.torch.ai/> > > 5250 W 116th Pl, Suite 200., Leawood, KS 66211 > WWW.TORCH.AI <http://www.torch.ai/> > > > > > > *From:* Tim Allison <[email protected]> > *Sent:* Thursday, January 20, 2022 2:23 PM > *To:* Peter Kronenberg <[email protected]> > *Cc:* [email protected] > *Subject:* Re: TesseractOCRParser timeout > > > > 120 seconds > > > > > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java#L95 > <https://us-east-2.protection.sophos.com?d=github.com&u=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS90aWthL2Jsb2IvbWFpbi90aWthLXBhcnNlcnMvdGlrYS1wYXJzZXJzLXN0YW5kYXJkL3Rpa2EtcGFyc2Vycy1zdGFuZGFyZC1tb2R1bGVzL3Rpa2EtcGFyc2VyLW9jci1tb2R1bGUvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Rpa2EvcGFyc2VyL29jci9UZXNzZXJhY3RPQ1JDb25maWcuamF2YSNMOTU=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=S3I1VmFTOE8xV1lranU0dkVKUzNmNFIyRHF6Yytvb0JzUFd6WjRqQ1FJOD0=&h=253838093033445286ee12c7fe199651> > > > > On Thu, Jan 20, 2022 at 2:07 PM Peter Kronenberg < > [email protected]> wrote: > > At this point, I think it’s mostly our problem. But still want to > understand what Tika is doing. What is the default timeout? > > > > This is what is passed in to runOCRProcess > > long timeoutMillis = TikaTaskTimeout.*getTimeoutMillis**(*parseContext, > config.getTimeoutSeconds*() ** 1000*)*; > > > > but I can’t quite figure out where it’s getting the default from or if > it’s possible to override > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623 * > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651> > > 5250 W 116th Pl, Suite 200., Leawood, KS 66211 > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651> > > > > > > *From:* Tim Allison <[email protected]> > *Sent:* Thursday, January 20, 2022 12:40 PM > *To:* [email protected] > *Subject:* Re: TesseractOCRParser timeout > > > > So, y, that's a timeout on the forked process for tesseract. I've found > that poor quality/noisy images can take a bunch longer for tesseract to > process. > > > > If there's anything we need to fix or make configurable, please open an > issue. > > > > Cheers, > > > > Tim > > > > On Tue, Jan 18, 2022 at 8:51 PM Peter Kronenberg < > [email protected]> wrote: > > Unrelated to my previous questions. I’m getting some sort of timeout in > Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that say > ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it > spawn a separate process to do the OCR? We’re having some performance > issues, so in a way, this doesn’t come as a surprise. Just trying to > understand a little more what’s going on > > > > private void runOCRProcess(Process process, int timeout) throws > IOException, TikaException { > process.getOutputStream().close(); > InputStream out = process.getInputStream(); > InputStream err = process.getErrorStream(); > StringBuilder outBuilder = new StringBuilder(); > StringBuilder errBuilder = new StringBuilder(); > Thread outThread = this.logStream(out, outBuilder); > Thread errThread = this.logStream(err, errBuilder); > outThread.start(); > errThread.start(); > int exitValue = -2147483648; > > try { > boolean finished = process.waitFor((long)timeout, TimeUnit. > *SECONDS*); > if (!finished) { > throw new TikaException("TesseractOCRParser timeout"); > } > > exitValue = process.exitValue(); > } catch (InterruptedException var12) { > Thread.*currentThread*().interrupt(); > throw new TikaException("TesseractOCRParser interrupted", var12); > } catch (IllegalThreadStateException var13) { > throw new TikaException("TesseractOCRParser timeout"); > } > > > > > > > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] > <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3> > > 5250 W 116th Pl, Suite 200., Leawood, KS 66211 > WWW.TORCH.AI > <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3> > > > > > >
