Did you check what it was doing by getting a threaddump or using a profiler?
On Sun, Dec 22, 2013 at 3:25 PM, Clemens Wyss DEV <[email protected]>wrote: > Issued a bug https://issues.apache.org/jira/browse/TIKA-1213 allthough > I'm not sure whether it's abug or me applying the API inappropriately. > > Could the newly introduced NonSequentialPDFParser "help"? > > -----Ursprüngliche Nachricht----- > Von: Clemens Wyss DEV [mailto:[email protected]] > Gesendet: Sonntag, 22. Dezember 2013 10:08 > An: [email protected] > Betreff: How can parsing a 5Mb take 3minutes? > > I have a 3Mb pdf files (and others) that takes 3 minutes to extract ist > content. In my test I am using AutodetectParser (and PDFParser). > I have built Tika from sources, i.e. am using 1.5 snapshot. > > Can anybody explain why/how this is possible? > > Where/how can I send the very document? > > Regards > Clemens > -- Jeroen Reijn Hippo Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 101 Main Street, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com http://about.me/jeroenreijn
