Hi,
I am using a Tika server v1.24.1 and configured it to do OCR on PDF files. I have a process that sends files one by one to the Tika server with a timeout of x minutes configured on the connection. So if Tika is not able to return the extracted text until the timeout, the connection is killed and a new one is initialized to process the next file. The problem is that when a timeout happens, the tesseract process initialized by the Tika server is not killed. So when my process sends the next file to the Tika Server, another Tesseract process is initialized. As consequence, the chances that a timeout occurs on the new file increase. If the timeout is reached consecutively on several files it triggers a snowball effect and I end up with a lot of tesseract processes which take all the CPU resources of the machine in addition to have no more PDF file correctly processed. So my question is : is there a way to force the Tika Server to kill the Tesseract process it has initialized when the connection that sent the file is closed ? Regards, Julien Massiera
