Tim,

I am seeing a lot of files that are taking a long time to parse and I am 
currently gathering some samples from our company's servers that I can use 
publicly as most are proprietary to our customers and a good number are 
malware, which may mean they are deliberately broken in format and causing the 
underlying parsers  some issues.

Do you think that Tika could be made to abort a parse after a certain time, or 
is that too complicated given that there are so many underlying parser 
mechanisms?

Cheers,

Jim

> -----Original Message-----
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Friday, November 17, 2017 00:04
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
> 
> It boggles my mind that SAX parsing would take 5 minutes, but, um, maybe?
> Now that I think about it there was a beastly pptx file that someone
> submitted on our JIRA that did take 2 minutes, so, maybe???
> 
> Open an issue in our JIRA to make extraction of charts/diagrams configurable,
> and you'll be able to tell. 😊
> 
> -----Original Message-----
> From: undersp...@gmail.com [mailto:undersp...@gmail.com]
> Sent: Wednesday, November 15, 2017 10:23 AM
> To: user@tika.apache.org
> Subject: Re: Very slow parsing of a few PDF files
> 
> 
> 
> On 2017-11-07 02:52, Jim Idle <ji...@proofpoint.com> wrote:
> > I have a few PDF files that are taking a very long time to parse.
> >
> > For instance I have a file that is 6.89MB that is taking minutes to parse. 
> > If I
> use jvisualvm and take a long sample, I get:
> >
> I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken around 5
> mins to run. While looking at the CHANGES files noted some new stuff
> around xlsx, namely :
> 
> * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
> * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
> 
> I then rolled back to version 1.15 and the same file took less than a second.
> 
> Is there a way to be sure if these changes were responsible for the extra
> processing time? if so how can I disable them?
> 
>  Sorry, but I cant share the file but can say it has some chart data.
>  unzip  -l  tika-killer.xlsx  | grep -c xl/chart
> 564
> 
> 
> José Borges Ferreira

Reply via email to