Just a guess, if you are OCRing multipage TIF files, that may be the
reason, I "think" Tika sends the whole TIF to tesseract and that could take
a large amount of time if there are lots of pages, triggering timeouts. In
our project, we send each TIF page at a time to tesseract and restart the
timeout counter to avoid this.

Best regards,
Luís Nassif

Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg <
[email protected]> escreveu:

> Unrelated to my previous questions.  I’m getting some sort of timeout in
> Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say
> ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
> spawn a separate process to do the OCR?  We’re having some performance
> issues, so in a way, this doesn’t come as a surprise.  Just trying to
> understand a little more what’s going on
>
>
>
> private void runOCRProcess(Process process, int timeout) throws
> IOException, TikaException {
>     process.getOutputStream().close();
>     InputStream out = process.getInputStream();
>     InputStream err = process.getErrorStream();
>     StringBuilder outBuilder = new StringBuilder();
>     StringBuilder errBuilder = new StringBuilder();
>     Thread outThread = this.logStream(out, outBuilder);
>     Thread errThread = this.logStream(err, errBuilder);
>     outThread.start();
>     errThread.start();
>     int exitValue = -2147483648;
>
>     try {
>         boolean finished = process.waitFor((long)timeout, TimeUnit.
> *SECONDS*);
>         if (!finished) {
>             throw new TikaException("TesseractOCRParser timeout");
>         }
>
>         exitValue = process.exitValue();
>     } catch (InterruptedException var12) {
>         Thread.*currentThread*().interrupt();
>         throw new TikaException("TesseractOCRParser interrupted", var12);
>     } catch (IllegalThreadStateException var13) {
>         throw new TikaException("TesseractOCRParser timeout");
>     }
>
>
>
>
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>

Reply via email to