Great. I opened TIKA-2514 to track this. Pull requests are welcomed!
-Original Message-
From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Wednesday, November 29, 2017 8:58 PM
To: user@tika.apache.org
Subject: RE: Very slow parsing of a few PDF files
That would be a more practical
age-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Wednesday, November 29, 2017 23:52
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
>
> >I am going to have to write my own application specific solution
>
> Ugh. I
>I am going to have to write my own application specific solution
Ugh. I'm sorry. If there's anything shareable, please do share.
> ForkParser tries to serialize every class it things will be needed across the
> connection and a lot of third party classes are not serializable. I think
> that
rg
> Subject: RE: Very slow parsing of a few PDF files
>
>
>
> >As the HTML parser in Tika does not produce SAX events in the correct
> order - the parser is great but does not support serialization - etc.
>
> Oh, please open a ticket with examples, or point me to one I've forgotten
> about... ☹ Thank you!
>As the HTML parser in Tika does not produce SAX events in the correct order -
>the parser is great but does not support serialization - etc.
Oh, please open a ticket with examples, or point me to one I've forgotten
about... ☹ Thank you!
in the correct order -
the parser is great but does not support serialization - etc.
Jim
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, November 27, 2017 23:05
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of
21, 2017 11:13 PM
To: user@tika.apache.org
Subject: RE: Very slow parsing of a few PDF files
I didn't know that there was a ForkParser, but that might possibly be a
significant overhead on the application - looks like it has a pool, though I
don't know if it gives the ability to say kill a long
to:apa...@gagravarr.org]
> Sent: Tuesday, November 21, 2017 17:10
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
>
> On Tue, 21 Nov 2017, Jim Idle wrote:
> > Following up on this, I will try cancelling my thread based tasks
> > after a pre-s
On Tue, 21 Nov 2017, Jim Idle wrote:
Following up on this, I will try cancelling my thread based tasks after
a pre-set time limit. That is only going to work if Tika and the
underlying parsers behave correctly with the interrupted exception.
Anyone had any success with that? I am mainly
dave2w...@comcast.net]
> Sent: Tuesday, November 21, 2017 12:06
> To: user@tika.apache.org
> Subject: Re: Very slow parsing of a few PDF files
>
> IIRC - In a Mac version of PowerPoint some seven years Microsoft went off
> OOXML spec which caused POI Produced files to runaway. A POI user
gt;> Sent: Monday, November 20, 2017 11:54
>> To: user@tika.apache.org
>> Subject: RE: Very slow parsing of a few PDF files
>>
>> Tim,
>>
>> I am seeing a lot of files that are taking a long time to parse and I am
>> currently gathering some samples from o
will try it
myself of course, but perhaps someone has already been down this path?
Jim
> -Original Message-
> From: Jim Idle [mailto:ji...@proofpoint.com]
> Sent: Monday, November 20, 2017 11:54
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF
:talli...@mitre.org]
> Sent: Friday, November 17, 2017 00:04
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
>
> It boggles my mind that SAX parsing would take 5 minutes, but, um, maybe?
> Now that I think about it there was a beastly pptx file that so
On Tue, 7 Nov 2017, Jim Idle wrote:
I have a few PDF files that are taking a very long time to parse.
Are you sure it's a PDF? The profiler images you've sent are all for
Apache POI and seem to show a XLS file being parsed
Nick
14 matches
Mail list logo