Re: Stream parsing huge PDF document in order to prevent memory issues

Stefan Magnus Landrø Thu, 06 Mar 2014 14:54:07 -0800

Hi Maruan,

So I created a small maven project containing a PDF-file I just generated
on my mac, and pushed it to https://github.com/landro/pdfboxbug
I could create a zip and upload to your bugtracker, but that feels kinda
awkward.
What do you prefer?


Stefan



2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>:

> Yes please, file a bug report together with a sample PDF and sample code
> to reproduce the issue. Which PDFBox version are you using?
>
> BR
> Maruan Sahyoun
>
> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
> [email protected]>:
>
> > Hi there,
> >
> > So I tried using the NonSequentialParser setting the
> > org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
> to
> > true.
> >
> > The memory footprint looks much better, however, I can't get the
> individual
> > pages due to a NPE in the getPage code.
> >
> > It turns out the resDict below is mostly null - which again causes a NPE
> in
> > parseDictObjects.
> >
> > Should I file a bug?
> >
> > Stefan
> >
> >
> >    public PDPage getPage(int pageNr) throws IOException
> >    {
> >        getPagesObject();
> >
> >        // ---- get list of top level pages
> >        COSArray kids = (COSArray)
> > pagesDictionary.getDictionaryObject(COSName.KIDS);
> >
> >        if (kids == null)
> >        {
> >            throw new IOException("Missing 'Kids' entry in pages
> > dictionary.");
> >        }
> >
> >        // ---- get page we are looking for (possibly going recursively
> into
> >        // subpages)
> >        COSObject pageObj = getPageObject(pageNr, kids, 0);
> >
> >        if (pageObj == null)
> >        {
> >            throw new IOException("Page " + pageNr + " not found.");
> >        }
> >
> >        // ---- parse all objects necessary to load page.
> >        COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> >
> >        if (parseMinimalCatalog && (!allPagesParsed))
> >        {
> >            // parse page resources since we did not do this on start
> >            COSDictionary resDict = (COSDictionary)
> > pageDict.getDictionaryObject(COSName.RESOURCES);
> >            parseDictObjects(resDict);
> >        }
> >
> >        return new PDPage(pageDict);
> >    }
> >
> >
> >
> > 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>:
> >
> >> Hi,
> >>
> >> PDF is a random access format with key information (the Cross Reference
> >> where to find the objects) being at the end of the file and the PDF
> objects
> >> spread around the file.
> >>
> >> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> >> instead of PDDocument.load and set the system property
> >> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
> does
> >> a minimal parsing of the PDF. That could reduce the memory consumption a
> >> little bit.  Unfortunately once an object has been parsed it’s content
> >> stays in memory so you would need to do a low level parsing yourself
> with
> >> the information available from the initial parsing stage.
> >>
> >> Maruan Sahyoun
> >>
> >> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> >> [email protected]>:
> >>
> >>> Hi there,
> >>>
> >>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> >>> according to the following rule set:
> >>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> >>> - There should be no content within a certain rectangular area of a
> page
> >>> (left margin where the print shop inserts a bar code)
> >>> - Number of pages should be less than N
> >>> - PDF version used
> >>>
> >>> So far we've been using
> >>>
> >>> PDDocument.load with a scratch file, but with huge documents (e.g.
> >> product
> >>> catalogues), things explode.
> >>> Is there a way to stream parse a PDF similar to stream parsing an XML
> >>> document (e.g. using StAX) and validate one page at a time?
> >>>
> >>> Cheers
> >>>
> >>> Stefan
> >>
> >>
> >
> >
> > --
> > BEKK Open
> > http://open.bekk.no
> >
> > TesTcl - a unit test framework for iRules
> > http://testcl.com
>
>


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Reply via email to