RE: Very slow parsing of a few PDF files

2017-11-30 Thread Allison, Timothy B.
Great. I opened TIKA-2514 to track this. Pull requests are welcomed!  -Original Message- From: Jim Idle [mailto:ji...@proofpoint.com] Sent: Wednesday, November 29, 2017 8:58 PM To: user@tika.apache.org Subject: RE: Very slow parsing of a few PDF files That would be a more practical

RE: Very slow parsing of a few PDF files

2017-11-29 Thread Jim Idle
age- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Wednesday, November 29, 2017 23:52 > To: user@tika.apache.org > Subject: RE: Very slow parsing of a few PDF files > > >I am going to have to write my own application specific solution > > Ugh. I

RE: Very slow parsing of a few PDF files

2017-11-29 Thread Allison, Timothy B.
>I am going to have to write my own application specific solution Ugh. I'm sorry. If there's anything shareable, please do share. > ForkParser tries to serialize every class it things will be needed across the > connection and a lot of third party classes are not serializable. I think > that

RE: Very slow parsing of a few PDF files

2017-11-28 Thread Jim Idle
rg > Subject: RE: Very slow parsing of a few PDF files > > > > >As the HTML parser in Tika does not produce SAX events in the correct > order - the parser is great but does not support serialization - etc. > > Oh, please open a ticket with examples, or point me to one I've forgotten > about... ☹ Thank you!

RE: Very slow parsing of a few PDF files

2017-11-28 Thread Allison, Timothy B.
>As the HTML parser in Tika does not produce SAX events in the correct order - >the parser is great but does not support serialization - etc. Oh, please open a ticket with examples, or point me to one I've forgotten about... ☹ Thank you!

RE: Very slow parsing of a few PDF files

2017-11-27 Thread Jim Idle
in the correct order - the parser is great but does not support serialization - etc. Jim > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, November 27, 2017 23:05 > To: user@tika.apache.org > Subject: RE: Very slow parsing of

RE: Very slow parsing of a few PDF files

2017-11-27 Thread Allison, Timothy B.
21, 2017 11:13 PM To: user@tika.apache.org Subject: RE: Very slow parsing of a few PDF files I didn't know that there was a ForkParser, but that might possibly be a significant overhead on the application - looks like it has a pool, though I don't know if it gives the ability to say kill a long

RE: Very slow parsing of a few PDF files

2017-11-21 Thread Jim Idle
to:apa...@gagravarr.org] > Sent: Tuesday, November 21, 2017 17:10 > To: user@tika.apache.org > Subject: RE: Very slow parsing of a few PDF files > > On Tue, 21 Nov 2017, Jim Idle wrote: > > Following up on this, I will try cancelling my thread based tasks > > after a pre-s

RE: Very slow parsing of a few PDF files

2017-11-21 Thread Nick Burch
On Tue, 21 Nov 2017, Jim Idle wrote: Following up on this, I will try cancelling my thread based tasks after a pre-set time limit. That is only going to work if Tika and the underlying parsers behave correctly with the interrupted exception. Anyone had any success with that? I am mainly

RE: Very slow parsing of a few PDF files

2017-11-20 Thread Jim Idle
dave2w...@comcast.net] > Sent: Tuesday, November 21, 2017 12:06 > To: user@tika.apache.org > Subject: Re: Very slow parsing of a few PDF files > > IIRC - In a Mac version of PowerPoint some seven years Microsoft went off > OOXML spec which caused POI Produced files to runaway. A POI user

Re: Very slow parsing of a few PDF files

2017-11-20 Thread Dave Fisher
gt;> Sent: Monday, November 20, 2017 11:54 >> To: user@tika.apache.org >> Subject: RE: Very slow parsing of a few PDF files >> >> Tim, >> >> I am seeing a lot of files that are taking a long time to parse and I am >> currently gathering some samples from o

RE: Very slow parsing of a few PDF files

2017-11-20 Thread Jim Idle
will try it myself of course, but perhaps someone has already been down this path? Jim > -Original Message- > From: Jim Idle [mailto:ji...@proofpoint.com] > Sent: Monday, November 20, 2017 11:54 > To: user@tika.apache.org > Subject: RE: Very slow parsing of a few PDF

RE: Very slow parsing of a few PDF files

2017-11-19 Thread Jim Idle
:talli...@mitre.org] > Sent: Friday, November 17, 2017 00:04 > To: user@tika.apache.org > Subject: RE: Very slow parsing of a few PDF files > > It boggles my mind that SAX parsing would take 5 minutes, but, um, maybe? > Now that I think about it there was a beastly pptx file that so

Re: Very slow parsing of a few PDF files

2017-11-06 Thread Nick Burch
On Tue, 7 Nov 2017, Jim Idle wrote: I have a few PDF files that are taking a very long time to parse. Are you sure it's a PDF? The profiler images you've sent are all for Apache POI and seem to show a XLS file being parsed Nick