Thanks for the answer Tim. Indeed, I tested the taskTimeoutMillis parameter and 
it correctly kills the tesseract process linked to the child task !

-----Message d'origine-----
De : Tim Allison <[email protected]> 
Envoyé : jeudi 4 février 2021 16:18
À : [email protected]
Objet : Re: Tika server and Tesseract process

Sadly, not cleanly yet...  I'm working on adding an async mode to tika-server 
that will let the process complete after you send the request.

If you run tika-server in -spawnChild mode, it will restart the whole server 
after a parser times out.  This will kill any threads/parses that are currently 
running, and I'm hoping this will kill the tesseract process.  You can set the 
max time for a parse with -taskTimeoutMillis.

-spawnChild will be the new default in 2.x.

On Thu, Feb 4, 2021 at 7:56 AM <[email protected]> wrote:
>
> Hi,
>
>
>
> I am using a Tika server v1.24.1 and configured it to do OCR on PDF files. I 
> have a process that sends files one by one to the Tika server with a timeout 
> of x minutes configured on the connection. So if Tika is not able to return 
> the extracted text until the timeout, the connection is killed and a new one 
> is initialized to process the next file.
>
> The problem is that when a timeout happens, the tesseract process initialized 
> by the Tika server is not killed. So when my process sends the next file to 
> the Tika Server, another Tesseract process is initialized. As consequence, 
> the chances that a timeout occurs on the new file increase. If the timeout is 
> reached consecutively on several files it triggers a snowball effect and I 
> end up with a lot of tesseract processes which take all the CPU resources of 
> the machine in addition to have no more PDF file correctly processed.
>
>
>
> So my question is : is there a way to force the Tika Server to kill the 
> Tesseract process it has initialized when the connection that sent the file 
> is closed ?
>
>
>
> Regards,
>
> Julien Massiera
>
>

Reply via email to