It boggles my mind that SAX parsing would take 5 minutes, but, um, maybe? Now that I think about it there was a beastly pptx file that someone submitted on our JIRA that did take 2 minutes, so, maybe???
Open an issue in our JIRA to make extraction of charts/diagrams configurable, and you'll be able to tell. 😊 -----Original Message----- From: undersp...@gmail.com [mailto:undersp...@gmail.com] Sent: Wednesday, November 15, 2017 10:23 AM To: user@tika.apache.org Subject: Re: Very slow parsing of a few PDF files On 2017-11-07 02:52, Jim Idle <ji...@proofpoint.com> wrote: > I have a few PDF files that are taking a very long time to parse. > > For instance I have a file that is 6.89MB that is taking minutes to parse. If > I use jvisualvm and take a long sample, I get: > I'm having a similar problem ( yes, with XLS ) . A 2MB xlsx is taken around 5 mins to run. While looking at the CHANGES files noted some new stuff around xlsx, namely : * Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254). * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945). I then rolled back to version 1.15 and the same file took less than a second. Is there a way to be sure if these changes were responsible for the extra processing time? if so how can I disable them? Sorry, but I cant share the file but can say it has some chart data. unzip -l tika-killer.xlsx | grep -c xl/chart 564 José Borges Ferreira