Hi,

 

I am using a Tika server v1.24.1 and configured it to do OCR on PDF files. I
have a process that sends files one by one to the Tika server with a timeout
of x minutes configured on the connection. So if Tika is not able to return
the extracted text until the timeout, the connection is killed and a new one
is initialized to process the next file. 

The problem is that when a timeout happens, the tesseract process
initialized by the Tika server is not killed. So when my process sends the
next file to the Tika Server, another Tesseract process is initialized. As
consequence, the chances that a timeout occurs on the new file increase. If
the timeout is reached consecutively on several files it triggers a snowball
effect and I end up with a lot of tesseract processes which take all the CPU
resources of the machine in addition to have no more PDF file correctly
processed.

 

So my question is : is there a way to force the Tika Server to kill the
Tesseract process it has initialized when the connection that sent the file
is closed ? 

 

Regards,

Julien Massiera

 

Reply via email to