Hum, so forget my thoughts about timeouts with TIFs :-)

Em qui, 20 de jan de 2022 00:08, Peter Kronenberg <[email protected]>
escreveu:

> In my case, it’s a non-searchable PDF.  So I assume that Tika converts
> each page to a Tiff and then OCR’s it
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Wednesday, January 19, 2022 7:54 PM
> *To:* [email protected]
> *Cc:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> Sorry. You’re right. I think tesseract is supposed to handle multi page
> tiffs on its own.
>
>
>
> On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <[email protected]>
> wrote:
>
> Hi Tim,
>
>
>
> I'm sure Tika does that for PDFs, but I couldn't find that logic in the
> code base for TIFFs. Could you point to me what class does that?
>
>
>
> Luis
>
>
>
> Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]>
> escreveu:
>
> Yes.  Exactly right.  Tika spawns a process per page/image.
>
>
>
> On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg <
> [email protected]> wrote:
>
> I believe that Tika just OCR’s one page at a time.  My guess is that it
> spawns a process for each page.
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425>
>
> 5250 W 116
> <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425>
> th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425>
>
>
>
>
>
> *From:* Luís Filipe Nassif <[email protected]>
> *Sent:* Wednesday, January 19, 2022 11:11 AM
> *To:* [email protected]
> *Subject:* Re: TesseractOCRParser timeout
>
>
>
> Just a guess, if you are OCRing multipage TIF files, that may be the
> reason, I "think" Tika sends the whole TIF to tesseract and that could take
> a large amount of time if there are lots of pages, triggering timeouts. In
> our project, we send each TIF page at a time to tesseract and restart the
> timeout counter to avoid this.
>
>
>
> Best regards,
>
> Luís Nassif
>
>
>
> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg <
> [email protected]> escreveu:
>
> Unrelated to my previous questions.  I’m getting some sort of timeout in
> Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say
> ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
> spawn a separate process to do the OCR?  We’re having some performance
> issues, so in a way, this doesn’t come as a surprise.  Just trying to
> understand a little more what’s going on
>
>
>
> private void runOCRProcess(Process process, int timeout) throws
> IOException, TikaException {
>     process.getOutputStream().close();
>     InputStream out = process.getInputStream();
>     InputStream err = process.getErrorStream();
>     StringBuilder outBuilder = new StringBuilder();
>     StringBuilder errBuilder = new StringBuilder();
>     Thread outThread = this.logStream(out, outBuilder);
>     Thread errThread = this.logStream(err, errBuilder);
>     outThread.start();
>     errThread.start();
>     int exitValue = -2147483648;
>
>     try {
>         boolean finished = process.waitFor((long)timeout, TimeUnit.
> *SECONDS*);
>         if (!finished) {
>             throw new TikaException("TesseractOCRParser timeout");
>         }
>
>         exitValue = process.exitValue();
>     } catch (InterruptedException var12) {
>         Thread.*currentThread*().interrupt();
>         throw new TikaException("TesseractOCRParser interrupted", var12);
>     } catch (IllegalThreadStateException var13) {
>         throw new TikaException("TesseractOCRParser timeout");
>     }
>
>
>
>
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>
> 5250 W 116
> <https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425>
> th Pl, Suite 200., Leawood, KS 66211
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>
>
>
>
>
>

Reply via email to