RE: Very slow parsing of a few PDF files

Allison, Timothy B. Thu, 16 Nov 2017 08:04:46 -0800

It boggles my mind that SAX parsing would take 5 minutes, but, um, maybe?  Now 
that I think about it there was a beastly pptx file that someone submitted on 
our JIRA that did take 2 minutes, so, maybe???

Open an issue in our JIRA to make extraction of charts/diagrams configurable, 
and you'll be able to tell. 😊

-----Original Message-----
From: undersp...@gmail.com [mailto:undersp...@gmail.com] 
Sent: Wednesday, November 15, 2017 10:23 AM
To: user@tika.apache.org
Subject: Re: Very slow parsing of a few PDF files

On 2017-11-07 02:52, Jim Idle <ji...@proofpoint.com> wrote: 
> I have a few PDF files that are taking a very long time to parse.
> 
> For instance I have a file that is 6.89MB that is taking minutes to parse. If 
> I use jvisualvm and take a long sample, I get:
> 
I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken around 5 
mins to run. While looking at the CHANGES files noted some new stuff around 
xlsx, namely :

* Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
* Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).

I then rolled back to version 1.15 and the same file took less than a second.

Is there a way to be sure if these changes were responsible for the extra 
processing time? if so how can I disable them?

 Sorry, but I cant share the file but can say it has some chart data.
 unzip  -l  tika-killer.xlsx  | grep -c xl/chart
564

JosÃ© Borges Ferreira

RE: Very slow parsing of a few PDF files

Reply via email to