Is there a way to control how many processes Tika will create at one time?  For 
example, if I have a 100 page document, will it create 100 processes, 1 for 
each page?  If there a way to control this, perhaps a batch size?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <[email protected]>
Sent: Thursday, January 20, 2022 2:23 PM
To: Peter Kronenberg <[email protected]>
Cc: [email protected]
Subject: Re: TesseractOCRParser timeout


120 seconds

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java#L95<https://us-east-2.protection.sophos.com?d=github.com&u=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS90aWthL2Jsb2IvbWFpbi90aWthLXBhcnNlcnMvdGlrYS1wYXJzZXJzLXN0YW5kYXJkL3Rpa2EtcGFyc2Vycy1zdGFuZGFyZC1tb2R1bGVzL3Rpa2EtcGFyc2VyLW9jci1tb2R1bGUvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Rpa2EvcGFyc2VyL29jci9UZXNzZXJhY3RPQ1JDb25maWcuamF2YSNMOTU=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=S3I1VmFTOE8xV1lranU0dkVKUzNmNFIyRHF6Yytvb0JzUFd6WjRqQ1FJOD0=&h=253838093033445286ee12c7fe199651>

On Thu, Jan 20, 2022 at 2:07 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
At this point, I think it’s mostly our problem.   But still want to understand 
what Tika is doing.  What is the default timeout?

This is what is passed in to runOCRProcess
long timeoutMillis = TikaTaskTimeout.getTimeoutMillis(parseContext,
        config.getTimeoutSeconds() * 1000);

but I can’t quite figure out where it’s getting the default from or if it’s 
possible to override

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651>


From: Tim Allison <[email protected]<mailto:[email protected]>>
Sent: Thursday, January 20, 2022 12:40 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: TesseractOCRParser timeout

So, y, that's a timeout on the forked process for tesseract. I've found that 
poor quality/noisy images can take a bunch longer for tesseract to process.

If there's anything we need to fix or make configurable, please open an issue.

Cheers,

         Tim

On Tue, Jan 18, 2022 at 8:51 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Unrelated to my previous questions.  I’m getting some sort of timeout in Tika 
in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say 
‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it spawn a 
separate process to do the OCR?  We’re having some performance issues, so in a 
way, this doesn’t come as a surprise.  Just trying to understand a little more 
what’s going on

private void runOCRProcess(Process process, int timeout) throws IOException, 
TikaException {
    process.getOutputStream().close();
    InputStream out = process.getInputStream();
    InputStream err = process.getErrorStream();
    StringBuilder outBuilder = new StringBuilder();
    StringBuilder errBuilder = new StringBuilder();
    Thread outThread = this.logStream(out, outBuilder);
    Thread errThread = this.logStream(err, errBuilder);
    outThread.start();
    errThread.start();
    int exitValue = -2147483648;

    try {
        boolean finished = process.waitFor((long)timeout, TimeUnit.SECONDS);
        if (!finished) {
            throw new TikaException("TesseractOCRParser timeout");
        }

        exitValue = process.exitValue();
    } catch (InterruptedException var12) {
        Thread.currentThread().interrupt();
        throw new TikaException("TesseractOCRParser interrupted", var12);
    } catch (IllegalThreadStateException var13) {
        throw new TikaException("TesseractOCRParser timeout");
    }




Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>


Reply via email to