Thanks for the answer Tim. Indeed, I tested the taskTimeoutMillis parameter and it correctly kills the tesseract process linked to the child task !
-----Message d'origine----- De : Tim Allison <[email protected]> Envoyé : jeudi 4 février 2021 16:18 À : [email protected] Objet : Re: Tika server and Tesseract process Sadly, not cleanly yet... I'm working on adding an async mode to tika-server that will let the process complete after you send the request. If you run tika-server in -spawnChild mode, it will restart the whole server after a parser times out. This will kill any threads/parses that are currently running, and I'm hoping this will kill the tesseract process. You can set the max time for a parse with -taskTimeoutMillis. -spawnChild will be the new default in 2.x. On Thu, Feb 4, 2021 at 7:56 AM <[email protected]> wrote: > > Hi, > > > > I am using a Tika server v1.24.1 and configured it to do OCR on PDF files. I > have a process that sends files one by one to the Tika server with a timeout > of x minutes configured on the connection. So if Tika is not able to return > the extracted text until the timeout, the connection is killed and a new one > is initialized to process the next file. > > The problem is that when a timeout happens, the tesseract process initialized > by the Tika server is not killed. So when my process sends the next file to > the Tika Server, another Tesseract process is initialized. As consequence, > the chances that a timeout occurs on the new file increase. If the timeout is > reached consecutively on several files it triggers a snowball effect and I > end up with a lot of tesseract processes which take all the CPU resources of > the machine in addition to have no more PDF file correctly processed. > > > > So my question is : is there a way to force the Tika Server to kill the > Tesseract process it has initialized when the connection that sent the file > is closed ? > > > > Regards, > > Julien Massiera > >
