Re: Stream parsing huge PDF document in order to prevent memory issues

Maruan Sahyoun Thu, 06 Mar 2014 06:48:49 -0800

Yes please, file a bug report together with a sample PDF and sample code to 
reproduce the issue. Which PDFBox version are you using?


BR
Maruan Sahyoun

Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <[email protected]>:

> Hi there,
> 
> So I tried using the NonSequentialParser setting the
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
> true.
> 
> The memory footprint looks much better, however, I can't get the individual
> pages due to a NPE in the getPage code.
> 
> It turns out the resDict below is mostly null - which again causes a NPE in
> parseDictObjects.
> 
> Should I file a bug?
> 
> Stefan
> 
> 
>    public PDPage getPage(int pageNr) throws IOException
>    {
>        getPagesObject();
> 
>        // ---- get list of top level pages
>        COSArray kids = (COSArray)
> pagesDictionary.getDictionaryObject(COSName.KIDS);
> 
>        if (kids == null)
>        {
>            throw new IOException("Missing 'Kids' entry in pages
> dictionary.");
>        }
> 
>        // ---- get page we are looking for (possibly going recursively into
>        // subpages)
>        COSObject pageObj = getPageObject(pageNr, kids, 0);
> 
>        if (pageObj == null)
>        {
>            throw new IOException("Page " + pageNr + " not found.");
>        }
> 
>        // ---- parse all objects necessary to load page.
>        COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> 
>        if (parseMinimalCatalog && (!allPagesParsed))
>        {
>            // parse page resources since we did not do this on start
>            COSDictionary resDict = (COSDictionary)
> pageDict.getDictionaryObject(COSName.RESOURCES);
>            parseDictObjects(resDict);
>        }
> 
>        return new PDPage(pageDict);
>    }
> 
> 
> 
> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>:
> 
>> Hi,
>> 
>> PDF is a random access format with key information (the Cross Reference
>> where to find the objects) being at the end of the file and the PDF objects
>> spread around the file.
>> 
>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>> instead of PDDocument.load and set the system property
>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
>> a minimal parsing of the PDF. That could reduce the memory consumption a
>> little bit.  Unfortunately once an object has been parsed it’s content
>> stays in memory so you would need to do a low level parsing yourself with
>> the information available from the initial parsing stage.
>> 
>> Maruan Sahyoun
>> 
>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>> [email protected]>:
>> 
>>> Hi there,
>>> 
>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>> according to the following rule set:
>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>> - There should be no content within a certain rectangular area of a page
>>> (left margin where the print shop inserts a bar code)
>>> - Number of pages should be less than N
>>> - PDF version used
>>> 
>>> So far we've been using
>>> 
>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>> product
>>> catalogues), things explode.
>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>> document (e.g. using StAX) and validate one page at a time?
>>> 
>>> Cheers
>>> 
>>> Stefan
>> 
>> 
> 
> 
> -- 
> BEKK Open
> http://open.bekk.no
> 
> TesTcl - a unit test framework for iRules
> http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Reply via email to