Re: Parsing huge PDF (400Mb, 2700 pages)

Tilman Hausherr Thu, 14 Nov 2019 09:05:18 -0800

The PDF can be much bigger than 3GB when decompressed.


What you could try

1) using a scratch file (will be even slower) when opening the document
2) the on-demand parser, see
https://issues.apache.org/jira/browse/PDFBOX-4569

there is a branch on the svn server, you have to build from source.

Tilman

Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):


Good evening,

No, I am NOT using tika-server. And uh, I am a bit surprised to hear(read) that PDFBox does NOT stream the PDF.


So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.

christian

*From:*Tim Allison <talli...@apache.org>
*Sent:* Donnerstag, 14. November 2019 15:07
*To:* u...@tika.apache.org
*Cc:* users@pdfbox.apache.org
*Subject:* Re: Parsing huge PDF (400Mb, 2700 pages)

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed viastreaming. However, PDFBox does not currently parse PDFs in astreaming mode. It builds the full document tree -- PDFBox colleagueslet me know if I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:


    Hi,

    Are you using tika-server ? If yes and you can submit the data
    using a multipart/form-data payload then it may help, CXF (used by
    tika-server) should do the best effort at saving the multipart
    payloads to the temp locations on the disk, and thus minimize the
    memory requirements

    Cheers, Sergey

    On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext)
    <christian.ribe...@novartis.com
    <mailto:christian.ribe...@novartis.com>> wrote:

        Hi,

        My application handles all kind of documents (mainly PDFs). In
        a very few cases, you might expect huge PDFs (< 500MB).

        By around 400MB I am hitting the wall, parsing takes ages
        (although quite fast at the beginning). I've tried several
        ideas but none of them brought the desired amelioration.

        I have the impression that memory plays a role. I have no more
        than 3GB (and I think this should be enough as we are
        streaming the document and using event based XML parser).

        Are they things I should be aware of?

        Any hint would be very welcome. Thanks and have a nice day,

        christian

Re: Parsing huge PDF (400Mb, 2700 pages)

Reply via email to