Re: Stream parsing huge PDF document in order to prevent memory issues

Maruan Sahyoun Fri, 07 Mar 2014 03:49:26 -0800

Hi Stefan,

unfortunately this is seems to be a bug. When the parseMinimal property is set 
to true indirect objects are not followed when the PDF is parsed. May I ask you 
to file a issue in Jira [https://issues.apache.org/jira/browse/PDFBOX/] and 
attach the pdf file in question.


BR
Maruan Sahyoun

Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <[email protected]>:

> Hi Stefan,
> 
> just fine. If I need more information I’ll let you know.
> 
> BR
> Maruan Sahyoun
> 
> Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <[email protected]>:
> 
>> Hi Maruan,
>> 
>> So I created a small maven project containing a PDF-file I just generated
>> on my mac, and pushed it to https://github.com/landro/pdfboxbug
>> I could create a zip and upload to your bugtracker, but that feels kinda
>> awkward.
>> What do you prefer?
>> 
>> Stefan
>> 
>> 
>> 
>> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>:
>> 
>>> Yes please, file a bug report together with a sample PDF and sample code
>>> to reproduce the issue. Which PDFBox version are you using?
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
>>> [email protected]>:
>>> 
>>>> Hi there,
>>>> 
>>>> So I tried using the NonSequentialParser setting the
>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
>>> to
>>>> true.
>>>> 
>>>> The memory footprint looks much better, however, I can't get the
>>> individual
>>>> pages due to a NPE in the getPage code.
>>>> 
>>>> It turns out the resDict below is mostly null - which again causes a NPE
>>> in
>>>> parseDictObjects.
>>>> 
>>>> Should I file a bug?
>>>> 
>>>> Stefan
>>>> 
>>>> 
>>>>  public PDPage getPage(int pageNr) throws IOException
>>>>  {
>>>>      getPagesObject();
>>>> 
>>>>      // ---- get list of top level pages
>>>>      COSArray kids = (COSArray)
>>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
>>>> 
>>>>      if (kids == null)
>>>>      {
>>>>          throw new IOException("Missing 'Kids' entry in pages
>>>> dictionary.");
>>>>      }
>>>> 
>>>>      // ---- get page we are looking for (possibly going recursively
>>> into
>>>>      // subpages)
>>>>      COSObject pageObj = getPageObject(pageNr, kids, 0);
>>>> 
>>>>      if (pageObj == null)
>>>>      {
>>>>          throw new IOException("Page " + pageNr + " not found.");
>>>>      }
>>>> 
>>>>      // ---- parse all objects necessary to load page.
>>>>      COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>>>> 
>>>>      if (parseMinimalCatalog && (!allPagesParsed))
>>>>      {
>>>>          // parse page resources since we did not do this on start
>>>>          COSDictionary resDict = (COSDictionary)
>>>> pageDict.getDictionaryObject(COSName.RESOURCES);
>>>>          parseDictObjects(resDict);
>>>>      }
>>>> 
>>>>      return new PDPage(pageDict);
>>>>  }
>>>> 
>>>> 
>>>> 
>>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> PDF is a random access format with key information (the Cross Reference
>>>>> where to find the objects) being at the end of the file and the PDF
>>> objects
>>>>> spread around the file.
>>>>> 
>>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>>>>> instead of PDDocument.load and set the system property
>>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
>>> does
>>>>> a minimal parsing of the PDF. That could reduce the memory consumption a
>>>>> little bit.  Unfortunately once an object has been parsed it’s content
>>>>> stays in memory so you would need to do a low level parsing yourself
>>> with
>>>>> the information available from the initial parsing stage.
>>>>> 
>>>>> Maruan Sahyoun
>>>>> 
>>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>>>>> [email protected]>:
>>>>> 
>>>>>> Hi there,
>>>>>> 
>>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>>>>> according to the following rule set:
>>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>>>>> - There should be no content within a certain rectangular area of a
>>> page
>>>>>> (left margin where the print shop inserts a bar code)
>>>>>> - Number of pages should be less than N
>>>>>> - PDF version used
>>>>>> 
>>>>>> So far we've been using
>>>>>> 
>>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>>>>> product
>>>>>> catalogues), things explode.
>>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>>>>> document (e.g. using StAX) and validate one page at a time?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> Stefan
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> BEKK Open
>>>> http://open.bekk.no
>>>> 
>>>> TesTcl - a unit test framework for iRules
>>>> http://testcl.com
>>> 
>>> 
>> 
>> 
>> -- 
>> BEKK Open
>> http://open.bekk.no
>> 
>> TesTcl - a unit test framework for iRules
>> http://testcl.com
>

Re: Stream parsing huge PDF document in order to prevent memory issues

Reply via email to