For a given file, it spawns a process for each image/page, but does not
start one process until the last one has completed or been destroyed.

Because of noise in how operating systems deal with killing processes
and cleaning up their resources, my belief is that in practice there may be
some very small amount of time when the last process is still kind of
present (hasn't been cleaned up by OS), before the next starts.

On Mon, Jan 24, 2022 at 4:47 PM Peter Kronenberg <[email protected]>
wrote:

> Ok, so just to confirm, it spawns a new thread, but not until the previous
> thread finishes?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, January 24, 2022 4:18 PM
> *To:* Peter Kronenberg <[email protected]>
> *Cc:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> Tika processes each page sequentially in a single thread per document.  A
> one hundred page PDF that requires OCR on each page will take 100
> processes, sequentially.
>
>
>
> On Mon, Jan 24, 2022 at 3:47 PM Peter Kronenberg <
> [email protected]> wrote:
>
> Is there a way to control how many processes Tika will create at one
> time?  For example, if I have a 100 page document, will it create 100
> processes, 1 for each page?  If there a way to control this, perhaps a
> batch size?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d8ca33d398014e1a96948bfbce3c255a>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d8ca33d398014e1a96948bfbce3c255a>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Thursday, January 20, 2022 2:23 PM
> *To:* Peter Kronenberg <[email protected]>
> *Cc:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> 120 seconds
>
>
>
>
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java#L95
> <https://us-east-2.protection.sophos.com?d=github.com&u=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS90aWthL2Jsb2IvbWFpbi90aWthLXBhcnNlcnMvdGlrYS1wYXJzZXJzLXN0YW5kYXJkL3Rpa2EtcGFyc2Vycy1zdGFuZGFyZC1tb2R1bGVzL3Rpa2EtcGFyc2VyLW9jci1tb2R1bGUvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Rpa2EvcGFyc2VyL29jci9UZXNzZXJhY3RPQ1JDb25maWcuamF2YSNMOTU=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=S3I1VmFTOE8xV1lranU0dkVKUzNmNFIyRHF6Yytvb0JzUFd6WjRqQ1FJOD0=&h=253838093033445286ee12c7fe199651>
>
>
>
> On Thu, Jan 20, 2022 at 2:07 PM Peter Kronenberg <
> [email protected]> wrote:
>
> At this point, I think it’s mostly our problem.   But still want to
> understand what Tika is doing.  What is the default timeout?
>
>
>
> This is what is passed in to runOCRProcess
>
> long timeoutMillis = TikaTaskTimeout.*getTimeoutMillis**(*parseContext,
>         config.getTimeoutSeconds*() ** 1000*)*;
>
>
>
> but I can’t quite figure out where it’s getting the default from or if
> it’s possible to override
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Thursday, January 20, 2022 12:40 PM
> *To:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> So, y, that's a timeout on the forked process for tesseract. I've found
> that poor quality/noisy images can take a bunch longer for tesseract to
> process.
>
>
>
> If there's anything we need to fix or make configurable, please open an
> issue.
>
>
>
> Cheers,
>
>
>
>          Tim
>
>
>
> On Tue, Jan 18, 2022 at 8:51 PM Peter Kronenberg <
> [email protected]> wrote:
>
> Unrelated to my previous questions.  I’m getting some sort of timeout in
> Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say
> ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
> spawn a separate process to do the OCR?  We’re having some performance
> issues, so in a way, this doesn’t come as a surprise.  Just trying to
> understand a little more what’s going on
>
>
>
> private void runOCRProcess(Process process, int timeout) throws
> IOException, TikaException {
>     process.getOutputStream().close();
>     InputStream out = process.getInputStream();
>     InputStream err = process.getErrorStream();
>     StringBuilder outBuilder = new StringBuilder();
>     StringBuilder errBuilder = new StringBuilder();
>     Thread outThread = this.logStream(out, outBuilder);
>     Thread errThread = this.logStream(err, errBuilder);
>     outThread.start();
>     errThread.start();
>     int exitValue = -2147483648;
>
>     try {
>         boolean finished = process.waitFor((long)timeout, TimeUnit.
> *SECONDS*);
>         if (!finished) {
>             throw new TikaException("TesseractOCRParser timeout");
>         }
>
>         exitValue = process.exitValue();
>     } catch (InterruptedException var12) {
>         Thread.*currentThread*().interrupt();
>         throw new TikaException("TesseractOCRParser interrupted", var12);
>     } catch (IllegalThreadStateException var13) {
>         throw new TikaException("TesseractOCRParser timeout");
>     }
>
>
>
>
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
>
>
>
>
>
>

Reply via email to