Hi Tim,

I'm sure Tika does that for PDFs, but I couldn't find that logic in the
code base for TIFFs. Could you point to me what class does that?

Luis


Em qua, 19 de jan de 2022 15:00, Tim Allison <[email protected]> escreveu:

> Yes.  Exactly right.  Tika spawns a process per page/image.
>
> On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg <
> [email protected]> wrote:
>
>> I believe that Tika just OCR’s one page at a time.  My guess is that it
>> spawns a process for each page.
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623 *
>>
>> [image: Torch AI] <http://www.torch.ai/>
>>
>> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
>> WWW.TORCH.AI <http://www.torch.ai/>
>>
>>
>>
>>
>>
>> *From:* Luís Filipe Nassif <[email protected]>
>> *Sent:* Wednesday, January 19, 2022 11:11 AM
>> *To:* [email protected]
>> *Subject:* Re: TesseractOCRParser timeout
>>
>>
>>
>> Just a guess, if you are OCRing multipage TIF files, that may be the
>> reason, I "think" Tika sends the whole TIF to tesseract and that could take
>> a large amount of time if there are lots of pages, triggering timeouts. In
>> our project, we send each TIF page at a time to tesseract and restart the
>> timeout counter to avoid this.
>>
>>
>>
>> Best regards,
>>
>> Luís Nassif
>>
>>
>>
>> Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg <
>> [email protected]> escreveu:
>>
>> Unrelated to my previous questions.  I’m getting some sort of timeout in
>> Tika in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say
>> ‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it
>> spawn a separate process to do the OCR?  We’re having some performance
>> issues, so in a way, this doesn’t come as a surprise.  Just trying to
>> understand a little more what’s going on
>>
>>
>>
>> private void runOCRProcess(Process process, int timeout) throws
>> IOException, TikaException {
>>     process.getOutputStream().close();
>>     InputStream out = process.getInputStream();
>>     InputStream err = process.getErrorStream();
>>     StringBuilder outBuilder = new StringBuilder();
>>     StringBuilder errBuilder = new StringBuilder();
>>     Thread outThread = this.logStream(out, outBuilder);
>>     Thread errThread = this.logStream(err, errBuilder);
>>     outThread.start();
>>     errThread.start();
>>     int exitValue = -2147483648;
>>
>>     try {
>>         boolean finished = process.waitFor((long)timeout, TimeUnit.
>> *SECONDS*);
>>         if (!finished) {
>>             throw new TikaException("TesseractOCRParser timeout");
>>         }
>>
>>         exitValue = process.exitValue();
>>     } catch (InterruptedException var12) {
>>         Thread.*currentThread*().interrupt();
>>         throw new TikaException("TesseractOCRParser interrupted", var12);
>>     } catch (IllegalThreadStateException var13) {
>>         throw new TikaException("TesseractOCRParser timeout");
>>     }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI]
>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>>
>> 5250 W 116th Pl, Suite 200., Leawood, KS 66211
>> WWW.TORCH.AI
>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
>>
>>
>>
>>
>>
>>

Reply via email to