At this point, I think it’s mostly our problem. But still want to understand
what Tika is doing. What is the default timeout?
This is what is passed in to runOCRProcess
long timeoutMillis = TikaTaskTimeout.getTimeoutMillis(parseContext,
config.getTimeoutSeconds() * 1000);
but I can’t quite figure out where it’s getting the default from or if it’s
possible to override
Peter Kronenberg | Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<http://www.torch.ai/>
From: Tim Allison <[email protected]>
Sent: Thursday, January 20, 2022 12:40 PM
To: [email protected]
Subject: Re: TesseractOCRParser timeout
So, y, that's a timeout on the forked process for tesseract. I've found that
poor quality/noisy images can take a bunch longer for tesseract to process.
If there's anything we need to fix or make configurable, please open an issue.
Cheers,
Tim
On Tue, Jan 18, 2022 at 8:51 PM Peter Kronenberg
<[email protected]<mailto:[email protected]>> wrote:
Unrelated to my previous questions. I’m getting some sort of timeout in Tika
in TesseractOCRParser.runOCRProcess. It’s one of the errors that say
‘TesseractOCRParser timeout’. What exactly is it doing here? Does it spawn a
separate process to do the OCR? We’re having some performance issues, so in a
way, this doesn’t come as a surprise. Just trying to understand a little more
what’s going on
private void runOCRProcess(Process process, int timeout) throws IOException,
TikaException {
process.getOutputStream().close();
InputStream out = process.getInputStream();
InputStream err = process.getErrorStream();
StringBuilder outBuilder = new StringBuilder();
StringBuilder errBuilder = new StringBuilder();
Thread outThread = this.logStream(out, outBuilder);
Thread errThread = this.logStream(err, errBuilder);
outThread.start();
errThread.start();
int exitValue = -2147483648;
try {
boolean finished = process.waitFor((long)timeout, TimeUnit.SECONDS);
if (!finished) {
throw new TikaException("TesseractOCRParser timeout");
}
exitValue = process.exitValue();
} catch (InterruptedException var12) {
Thread.currentThread().interrupt();
throw new TikaException("TesseractOCRParser interrupted", var12);
} catch (IllegalThreadStateException var13) {
throw new TikaException("TesseractOCRParser timeout");
}
Peter Kronenberg | Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>