Ok, so just to confirm, it spawns a new thread, but not until the previous thread finishes?
Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<http://www.torch.ai/> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<http://www.torch.ai/> From: Tim Allison <[email protected]> Sent: Monday, January 24, 2022 4:18 PM To: Peter Kronenberg <[email protected]> Cc: [email protected] Subject: Re: TesseractOCRParser timeout Tika processes each page sequentially in a single thread per document. A one hundred page PDF that requires OCR on each page will take 100 processes, sequentially. On Mon, Jan 24, 2022 at 3:47 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Is there a way to control how many processes Tika will create at one time? For example, if I have a 100 page document, will it create 100 processes, 1 for each page? If there a way to control this, perhaps a batch size? Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d8ca33d398014e1a96948bfbce3c255a> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d8ca33d398014e1a96948bfbce3c255a> From: Tim Allison <[email protected]<mailto:[email protected]>> Sent: Thursday, January 20, 2022 2:23 PM To: Peter Kronenberg <[email protected]<mailto:[email protected]>> Cc: [email protected]<mailto:[email protected]> Subject: Re: TesseractOCRParser timeout 120 seconds https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java#L95<https://us-east-2.protection.sophos.com?d=github.com&u=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS90aWthL2Jsb2IvbWFpbi90aWthLXBhcnNlcnMvdGlrYS1wYXJzZXJzLXN0YW5kYXJkL3Rpa2EtcGFyc2Vycy1zdGFuZGFyZC1tb2R1bGVzL3Rpa2EtcGFyc2VyLW9jci1tb2R1bGUvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Rpa2EvcGFyc2VyL29jci9UZXNzZXJhY3RPQ1JDb25maWcuamF2YSNMOTU=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=S3I1VmFTOE8xV1lranU0dkVKUzNmNFIyRHF6Yytvb0JzUFd6WjRqQ1FJOD0=&h=253838093033445286ee12c7fe199651> On Thu, Jan 20, 2022 at 2:07 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: At this point, I think it’s mostly our problem. But still want to understand what Tika is doing. What is the default timeout? This is what is passed in to runOCRProcess long timeoutMillis = TikaTaskTimeout.getTimeoutMillis(parseContext, config.getTimeoutSeconds() * 1000); but I can’t quite figure out where it’s getting the default from or if it’s possible to override Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=253838093033445286ee12c7fe199651> From: Tim Allison <[email protected]<mailto:[email protected]>> Sent: Thursday, January 20, 2022 12:40 PM To: [email protected]<mailto:[email protected]> Subject: Re: TesseractOCRParser timeout So, y, that's a timeout on the forked process for tesseract. I've found that poor quality/noisy images can take a bunch longer for tesseract to process. If there's anything we need to fix or make configurable, please open an issue. Cheers, Tim On Tue, Jan 18, 2022 at 8:51 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Unrelated to my previous questions. I’m getting some sort of timeout in Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that say ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it spawn a separate process to do the OCR? We’re having some performance issues, so in a way, this doesn’t come as a surprise. Just trying to understand a little more what’s going on private void runOCRProcess(Process process, int timeout) throws IOException, TikaException { process.getOutputStream().close(); InputStream out = process.getInputStream(); InputStream err = process.getErrorStream(); StringBuilder outBuilder = new StringBuilder(); StringBuilder errBuilder = new StringBuilder(); Thread outThread = this.logStream(out, outBuilder); Thread errThread = this.logStream(err, errBuilder); outThread.start(); errThread.start(); int exitValue = -2147483648; try { boolean finished = process.waitFor((long)timeout, TimeUnit.SECONDS); if (!finished) { throw new TikaException("TesseractOCRParser timeout"); } exitValue = process.exitValue(); } catch (InterruptedException var12) { Thread.currentThread().interrupt(); throw new TikaException("TesseractOCRParser interrupted", var12); } catch (IllegalThreadStateException var13) { throw new TikaException("TesseractOCRParser timeout"); } Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=044443de31a14922bd91e778eda966e3>
