Re: Parsing a pdf file takes 3minutes

Eliot Kimber Mon, 23 Dec 2013 08:17:04 -0800

50 seconds to get the text from a 200-page PDF seems slow to me, based
just on intuition, rather than measurement. It suggests that there may be
some inefficiency in the code—I would at least check for that before
determining that the speed is the best possible.


Considering that I can render a 200-page document from XML source to PDF
in a minute or two using XSL-FO and complex XSLT processing, which is
pretty data processing intensive, it doesn’t seem like just extracting the
text from the PDF should take a comparable amount of time, although there
is some data processing involved there as well, of course (for example,
decoding all the encoded strings).

It would be useful to see if different Acrobat-provided PDF optimizations
make a difference, like making the PDF streaming enabled or turning off
compression.

Cheers,

Eliot
-- 
Eliot Kimber
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.reallysi.com
www.rsuitecms.com




On 12/23/13, 9:59 AM, "Peter Murray-Rust" <[email protected]> wrote:

>Your document has 265 pages. What are you comparing with what? Your
>document against another document? or PDFBox against other code? I have
>run
>your document and it runs at the same speed as most others - it takes 50
>secs for first 200 pp, on mine. It will depend at least on the speed of
>your machine and the number of processors that can be parallelised .
>
>
>On Mon, Dec 23, 2013 at 3:12 PM, Clemens Wyss DEV
><[email protected]>wrote:
>
>> Opened an issue therefor
>> https://issues.apache.org/jira/browse/PDFBOX-1821
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Clemens Wyss - MySign AG [mailto:[email protected]]
>> Gesendet: Sonntag, 22. Dezember 2013 17:37
>> An: '[email protected]'
>> Betreff: Parsing a pdf file takes 3minutes
>>
>> I initially posted this question in the tika-mailing list, and I even
>> created an issue herefore:
>> https://issues.apache.org/jira/browse/TIKA-1213
>> Hopefully now being on the right list, I re-phrase the problem I am
>> confronted with:
>> I have (several) pdf documents which take up to 3minutes to be
>> parsed/extracted (for later lucene indexing).
>> For example  the pdf which is attached to the jira issue requires
>>3minutes.
>>
>> How/why is this possible? How can I improve on this?
>>
>> Any help appreciated
>> Clemens
>>
>
>
>
>-- 
>Peter Murray-Rust
>Reader in Molecular Informatics
>Unilever Centre, Dep. Of Chemistry
>University of Cambridge
>CB2 1EW, UK
>+44-1223-763069

Re: Parsing a pdf file takes 3minutes

Reply via email to