I didn't know that there was a ForkParser, but that might possibly be a 
significant overhead on the application - looks like it has a pool, though I 
don't know if it gives the ability to say kill a long running parser and 
restart the pool. I will look in to it: one thing I see already is that it 
intercepts Interrupted, wraps it in a TikaException but does not set the Thread 
interrupted flag and cannot rethrow Interrupted because the Parser interface 
does not throw it. It catches inability to communicate but does it start a new 
process if I cancel one

I may have no choice though as RecursiveParserWrapper, like any implementation 
of Parser does not check for Thread.interrupted() or throw Interrupted which 
means that I cannot time out a Future and cancel it.

Anyway, thanks for the pointer - I will play with it.

Jim

> -----Original Message-----
> From: Nick Burch [mailto:apa...@gagravarr.org]
> Sent: Tuesday, November 21, 2017 17:10
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
> 
> On Tue, 21 Nov 2017, Jim Idle wrote:
> > Following up on this, I will try cancelling my thread based tasks
> > after a pre-set time limit. That is only going to work if Tika and the
> > underlying parsers behave correctly with the interrupted exception.
> > Anyone had any success with that? I am mainly looking at Office, PDF
> > and HTML right now. I will try it myself of course, but perhaps
> > someone has already been down this path?
> 
> Have you tried with ForkParser? That would also protect you against other
> kinds of failures like OOM too
> 
> Nick

Reply via email to