Sorry. You’re right. I think tesseract is supposed to handle multi page
tiffs on its own.

On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <[email protected]>
wrote:

> Hi Tim,
>
> I'm sure Tika does that for PDFs, but I couldn't find that logic in the
> code base for TIFFs. Could you point to me what class does that?
>
> Luis
>
>
> Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]>
> escreveu:
>
>> Yes.  Exactly right.  Tika spawns a process per page/image.
>>
>> On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg <
>> [email protected]> wrote:
>>
>>> I believe that Tika just OCR’s one page at a time.  My guess is that it
>>> spawns a process for each page.
>>>
>>>
>>>
>>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>>
>>> *C: 703.887.5623 *
>>>
>>> [image: Torch AI] <http://www.torch.ai/>
>>>
>>> 5250 W 116
>>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th
>>> Pl, Suite 200., Leawood, KS 66211
>>> WWW.TORCH.AI <http://www.torch.ai/>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Luís Filipe Nassif <[email protected]>
>>> *Sent:* Wednesday, January 19, 2022 11:11 AM
>>> *To:* [email protected]
>>> *Subject:* Re: TesseractOCRParser timeout
>>>
>>>
>>>
>>> Just a guess, if you are OCRing multipage TIF files, that may be the
>>> reason, I "think" Tika sends the whole TIF to tesseract and that could take
>>> a large amount of time if there are lots of pages, triggering timeouts. In
>>> our project, we send each TIF page at a time to tesseract and restart the
>>> timeout counter to avoid this.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Luís Nassif
>>>
>>>
>>>
>>> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg <
>>> [email protected]> escreveu:
>>>
>>> Unrelated to my previous questions.  I’m getting some sort of timeout in
>>> Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say
>>> ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
>>> spawn a separate process to do the OCR?  We’re having some performance
>>> issues, so in a way, this doesn’t come as a surprise.  Just trying to
>>> understand a little more what’s going on
>>>
>>>
>>>
>>> private void runOCRProcess(Process process, int timeout) throws
>>> IOException, TikaException {
>>>     process.getOutputStream().close();
>>>     InputStream out = process.getInputStream();
>>>     InputStream err = process.getErrorStream();
>>>     StringBuilder outBuilder = new StringBuilder();
>>>     StringBuilder errBuilder = new StringBuilder();
>>>     Thread outThread = this.logStream(out, outBuilder);
>>>     Thread errThread = this.logStream(err, errBuilder);
>>>     outThread.start();
>>>     errThread.start();
>>>     int exitValue = -2147483648;
>>>
>>>     try {
>>>         boolean finished = process.waitFor((long)timeout, TimeUnit.
>>> *SECONDS*);
>>>         if (!finished) {
>>>             throw new TikaException("TesseractOCRParser timeout");
>>>         }
>>>
>>>         exitValue = process.exitValue();
>>>     } catch (InterruptedException var12) {
>>>         Thread.*currentThread*().interrupt();
>>>         throw new TikaException("TesseractOCRParser interrupted",
>>> var12);
>>>     } catch (IllegalThreadStateException var13) {
>>>         throw new TikaException("TesseractOCRParser timeout");
>>>     }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>>
>>> *C: 703.887.5623*
>>>
>>> [image: Torch AI]
>>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>>>
>>> 5250 W 116
>>> <https://www.google.com/maps/search/5250+W+116?entry=gmail&source=g>th
>>> Pl, Suite 200., Leawood, KS 66211
>>> WWW.TORCH.AI
>>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to