Yay! As I mentioned, -spawnChild will become the default in 2.x.
Your client has to be prepared for the server to be in "restart" mode
(e.g. not available). The other challenge is that if you have > 1
thread submitting tasks to the parser, you can't tell which file
caused the server to crash.
That said, this is far more robust.
Please, please let us know how we can improve robustness and usability.
Thank you for getting back to us so quickly!
Cheers,
Tim
On Thu, Feb 4, 2021 at 10:52 AM <[email protected]> wrote:
>
> Thanks for the answer Tim. Indeed, I tested the taskTimeoutMillis parameter
> and it correctly kills the tesseract process linked to the child task !
>
> -----Message d'origine-----
> De : Tim Allison <[email protected]>
> Envoyé : jeudi 4 février 2021 16:18
> À : [email protected]
> Objet : Re: Tika server and Tesseract process
>
> Sadly, not cleanly yet... I'm working on adding an async mode to tika-server
> that will let the process complete after you send the request.
>
> If you run tika-server in -spawnChild mode, it will restart the whole server
> after a parser times out. This will kill any threads/parses that are
> currently running, and I'm hoping this will kill the tesseract process. You
> can set the max time for a parse with -taskTimeoutMillis.
>
> -spawnChild will be the new default in 2.x.
>
> On Thu, Feb 4, 2021 at 7:56 AM <[email protected]> wrote:
> >
> > Hi,
> >
> >
> >
> > I am using a Tika server v1.24.1 and configured it to do OCR on PDF files.
> > I have a process that sends files one by one to the Tika server with a
> > timeout of x minutes configured on the connection. So if Tika is not able
> > to return the extracted text until the timeout, the connection is killed
> > and a new one is initialized to process the next file.
> >
> > The problem is that when a timeout happens, the tesseract process
> > initialized by the Tika server is not killed. So when my process sends the
> > next file to the Tika Server, another Tesseract process is initialized. As
> > consequence, the chances that a timeout occurs on the new file increase. If
> > the timeout is reached consecutively on several files it triggers a
> > snowball effect and I end up with a lot of tesseract processes which take
> > all the CPU resources of the machine in addition to have no more PDF file
> > correctly processed.
> >
> >
> >
> > So my question is : is there a way to force the Tika Server to kill the
> > Tesseract process it has initialized when the connection that sent the file
> > is closed ?
> >
> >
> >
> > Regards,
> >
> > Julien Massiera
> >
> >
>