I didn't know that there was a ForkParser, but that might possibly be a significant overhead on the application - looks like it has a pool, though I don't know if it gives the ability to say kill a long running parser and restart the pool. I will look in to it: one thing I see already is that it intercepts Interrupted, wraps it in a TikaException but does not set the Thread interrupted flag and cannot rethrow Interrupted because the Parser interface does not throw it. It catches inability to communicate but does it start a new process if I cancel one
I may have no choice though as RecursiveParserWrapper, like any implementation of Parser does not check for Thread.interrupted() or throw Interrupted which means that I cannot time out a Future and cancel it. Anyway, thanks for the pointer - I will play with it. Jim > -----Original Message----- > From: Nick Burch [mailto:apa...@gagravarr.org] > Sent: Tuesday, November 21, 2017 17:10 > To: user@tika.apache.org > Subject: RE: Very slow parsing of a few PDF files > > On Tue, 21 Nov 2017, Jim Idle wrote: > > Following up on this, I will try cancelling my thread based tasks > > after a pre-set time limit. That is only going to work if Tika and the > > underlying parsers behave correctly with the interrupted exception. > > Anyone had any success with that? I am mainly looking at Office, PDF > > and HTML right now. I will try it myself of course, but perhaps > > someone has already been down this path? > > Have you tried with ForkParser? That would also protect you against other > kinds of failures like OOM too > > Nick