Yes, tesseract does. But if a 100 pages tiff is sent directly to tesseract,
it could take minutes and trigger a timeout. Default ocr timeout is 120s
and my experience says tesseract takes 3-4s per page in average.


Em qua, 19 de jan de 2022 21:54, Tim Allison <[email protected]> escreveu:

> Sorry. You’re right. I think tesseract is supposed to handle multi page
> tiffs on its own.
>
> On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <[email protected]>
> wrote:
>
>> Hi Tim,
>>
>> I'm sure Tika does that for PDFs, but I couldn't find that logic in the
>> code base for TIFFs. Could you point to me what class does that?
>>
>> Luis
>>
>>
>> Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]>
>> escreveu:
>>
>>> Yes.  Exactly right.  Tika spawns a process per page/image.
>>>
>>> On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg <
>>> [email protected]> wrote:
>>>
>>>> I believe that Tika just OCR’s one page at a time.  My guess is that it
>>>> spawns a process for each page.
>>>>
>>>>
>>>>
>>>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>>>
>>>> *C: 703.887.5623 *
>>>>
>>>> [image: Torch AI] <http://www.torch.ai/>
>>>>
>>>> 5250 W 116
>>>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th
>>>> Pl, Suite 200., Leawood, KS 66211
>>>> WWW.TORCH.AI <http://www.torch.ai/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* Luís Filipe Nassif <[email protected]>
>>>> *Sent:* Wednesday, January 19, 2022 11:11 AM
>>>> *To:* [email protected]
>>>> *Subject:* Re: TesseractOCRParser timeout
>>>>
>>>>
>>>>
>>>> Just a guess, if you are OCRing multipage TIF files, that may be the
>>>> reason, I "think" Tika sends the whole TIF to tesseract and that could take
>>>> a large amount of time if there are lots of pages, triggering timeouts. In
>>>> our project, we send each TIF page at a time to tesseract and restart the
>>>> timeout counter to avoid this.
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Luís Nassif
>>>>
>>>>
>>>>
>>>> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg <
>>>> [email protected]> escreveu:
>>>>
>>>> Unrelated to my previous questions.  I’m getting some sort of timeout
>>>> in Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that
>>>> say ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
>>>> spawn a separate process to do the OCR?  We’re having some performance
>>>> issues, so in a way, this doesn’t come as a surprise.  Just trying to
>>>> understand a little more what’s going on
>>>>
>>>>
>>>>
>>>> private void runOCRProcess(Process process, int timeout) throws
>>>> IOException, TikaException {
>>>>     process.getOutputStream().close();
>>>>     InputStream out = process.getInputStream();
>>>>     InputStream err = process.getErrorStream();
>>>>     StringBuilder outBuilder = new StringBuilder();
>>>>     StringBuilder errBuilder = new StringBuilder();
>>>>     Thread outThread = this.logStream(out, outBuilder);
>>>>     Thread errThread = this.logStream(err, errBuilder);
>>>>     outThread.start();
>>>>     errThread.start();
>>>>     int exitValue = -2147483648;
>>>>
>>>>     try {
>>>>         boolean finished = process.waitFor((long)timeout, TimeUnit.
>>>> *SECONDS*);
>>>>         if (!finished) {
>>>>             throw new TikaException("TesseractOCRParser timeout");
>>>>         }
>>>>
>>>>         exitValue = process.exitValue();
>>>>     } catch (InterruptedException var12) {
>>>>         Thread.*currentThread*().interrupt();
>>>>         throw new TikaException("TesseractOCRParser interrupted",
>>>> var12);
>>>>     } catch (IllegalThreadStateException var13) {
>>>>         throw new TikaException("TesseractOCRParser timeout");
>>>>     }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>>>
>>>> *C: 703.887.5623*
>>>>
>>>> [image: Torch AI]
>>>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>>>>
>>>> 5250 W 116
>>>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th
>>>> Pl, Suite 200., Leawood, KS 66211
>>>> WWW.TORCH.AI
>>>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Reply via email to