Tika processes each page sequentially in a single thread per document.  A
one hundred page PDF that requires OCR on each page will take 100
processes, sequentially.

On Mon, Jan 24, 2022 at 3:47 PM Peter Kronenberg <[email protected]>
wrote:

> Is there a way to control how many processes Tika will create at one
> time?  For example, if I have a 100 page document, will it create 100
> processes, 1 for each page?  If there a way to control this, perhaps a
> batch size?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Thursday, January 20, 2022 2:23 PM
> *To:* Peter Kronenberg <[email protected]>
> *Cc:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> 120 seconds
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java#L95
> <https://us-east-2.protection.sophos.com?d=github.com&u=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS90aWthL2Jsb2IvbWFpbi90aWthLXBhcnNlcnMvdGlrYS1wYXJzZXJzLXN0YW5kYXJkL3Rpa2EtcGFyc2Vycy1zdGFuZGFyZC1tb2R1bGVzL3Rpa2EtcGFyc2VyLW9jci1tb2R1bGUvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Rpa2EvcGFyc2VyL29jci9UZXNzZXJhY3RPQ1JDb25maWcuamF2YSNMOTU=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=S3I1VmFTOE8xV1lranU0dkVKUzNmNFIyRHF6Yytvb0JzUFd6WjRqQ1FJOD0=&h=253838093033445286ee12c7fe199651>
>
>
>
> On Thu, Jan 20, 2022 at 2:07 PM Peter Kronenberg <
> [email protected]> wrote:
>
> At this point, I think it’s mostly our problem.   But still want to
> understand what Tika is doing.  What is the default timeout?
>
>
>
> This is what is passed in to runOCRProcess
>
> long timeoutMillis = TikaTaskTimeout.*getTimeoutMillis**(*parseContext,
>         config.getTimeoutSeconds*() ** 1000*)*;
>
>
>
> but I can’t quite figure out where it’s getting the default from or if
> it’s possible to override
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Thursday, January 20, 2022 12:40 PM
> *To:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> So, y, that's a timeout on the forked process for tesseract. I've found
> that poor quality/noisy images can take a bunch longer for tesseract to
> process.
>
>
>
> If there's anything we need to fix or make configurable, please open an
> issue.
>
>
>
> Cheers,
>
>
>
>          Tim
>
>
>
> On Tue, Jan 18, 2022 at 8:51 PM Peter Kronenberg <
> [email protected]> wrote:
>
> Unrelated to my previous questions.  I’m getting some sort of timeout in
> Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say
> ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
> spawn a separate process to do the OCR?  We’re having some performance
> issues, so in a way, this doesn’t come as a surprise.  Just trying to
> understand a little more what’s going on
>
>
>
> private void runOCRProcess(Process process, int timeout) throws
> IOException, TikaException {
>     process.getOutputStream().close();
>     InputStream out = process.getInputStream();
>     InputStream err = process.getErrorStream();
>     StringBuilder outBuilder = new StringBuilder();
>     StringBuilder errBuilder = new StringBuilder();
>     Thread outThread = this.logStream(out, outBuilder);
>     Thread errThread = this.logStream(err, errBuilder);
>     outThread.start();
>     errThread.start();
>     int exitValue = -2147483648;
>
>     try {
>         boolean finished = process.waitFor((long)timeout, TimeUnit.
> *SECONDS*);
>         if (!finished) {
>             throw new TikaException("TesseractOCRParser timeout");
>         }
>
>         exitValue = process.exitValue();
>     } catch (InterruptedException var12) {
>         Thread.*currentThread*().interrupt();
>         throw new TikaException("TesseractOCRParser interrupted", var12);
>     } catch (IllegalThreadStateException var13) {
>         throw new TikaException("TesseractOCRParser timeout");
>     }
>
>
>
>
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
>
>
>
>
>
>

Reply via email to