50 seconds to get the text from a 200-page PDF seems slow to me, based just on intuition, rather than measurement. It suggests that there may be some inefficiency in the code—I would at least check for that before determining that the speed is the best possible.
Considering that I can render a 200-page document from XML source to PDF in a minute or two using XSL-FO and complex XSLT processing, which is pretty data processing intensive, it doesn’t seem like just extracting the text from the PDF should take a comparable amount of time, although there is some data processing involved there as well, of course (for example, decoding all the encoded strings). It would be useful to see if different Acrobat-provided PDF optimizations make a difference, like making the PDF streaming enabled or turning off compression. Cheers, Eliot -- Eliot Kimber Senior Solutions Architect "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.reallysi.com www.rsuitecms.com On 12/23/13, 9:59 AM, "Peter Murray-Rust" <[email protected]> wrote: >Your document has 265 pages. What are you comparing with what? Your >document against another document? or PDFBox against other code? I have >run >your document and it runs at the same speed as most others - it takes 50 >secs for first 200 pp, on mine. It will depend at least on the speed of >your machine and the number of processors that can be parallelised . > > >On Mon, Dec 23, 2013 at 3:12 PM, Clemens Wyss DEV ><[email protected]>wrote: > >> Opened an issue therefor >> https://issues.apache.org/jira/browse/PDFBOX-1821 >> >> -----Ursprüngliche Nachricht----- >> Von: Clemens Wyss - MySign AG [mailto:[email protected]] >> Gesendet: Sonntag, 22. Dezember 2013 17:37 >> An: '[email protected]' >> Betreff: Parsing a pdf file takes 3minutes >> >> I initially posted this question in the tika-mailing list, and I even >> created an issue herefore: >> https://issues.apache.org/jira/browse/TIKA-1213 >> Hopefully now being on the right list, I re-phrase the problem I am >> confronted with: >> I have (several) pdf documents which take up to 3minutes to be >> parsed/extracted (for later lucene indexing). >> For example the pdf which is attached to the jira issue requires >>3minutes. >> >> How/why is this possible? How can I improve on this? >> >> Any help appreciated >> Clemens >> > > > >-- >Peter Murray-Rust >Reader in Molecular Informatics >Unilever Centre, Dep. Of Chemistry >University of Cambridge >CB2 1EW, UK >+44-1223-763069

