Hi Stefan, unfortunately this is seems to be a bug. When the parseMinimal property is set to true indirect objects are not followed when the PDF is parsed. May I ask you to file a issue in Jira [https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in question.
BR Maruan Sahyoun Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <[email protected]>: > Hi Stefan, > > just fine. If I need more information I’ll let you know. > > BR > Maruan Sahyoun > > Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <[email protected]>: > >> Hi Maruan, >> >> So I created a small maven project containing a PDF-file I just generated >> on my mac, and pushed it to https://github.com/landro/pdfboxbug >> I could create a zip and upload to your bugtracker, but that feels kinda >> awkward. >> What do you prefer? >> >> Stefan >> >> >> >> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>: >> >>> Yes please, file a bug report together with a sample PDF and sample code >>> to reproduce the issue. Which PDFBox version are you using? >>> >>> BR >>> Maruan Sahyoun >>> >>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø < >>> [email protected]>: >>> >>>> Hi there, >>>> >>>> So I tried using the NonSequentialParser setting the >>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property >>> to >>>> true. >>>> >>>> The memory footprint looks much better, however, I can't get the >>> individual >>>> pages due to a NPE in the getPage code. >>>> >>>> It turns out the resDict below is mostly null - which again causes a NPE >>> in >>>> parseDictObjects. >>>> >>>> Should I file a bug? >>>> >>>> Stefan >>>> >>>> >>>> public PDPage getPage(int pageNr) throws IOException >>>> { >>>> getPagesObject(); >>>> >>>> // ---- get list of top level pages >>>> COSArray kids = (COSArray) >>>> pagesDictionary.getDictionaryObject(COSName.KIDS); >>>> >>>> if (kids == null) >>>> { >>>> throw new IOException("Missing 'Kids' entry in pages >>>> dictionary."); >>>> } >>>> >>>> // ---- get page we are looking for (possibly going recursively >>> into >>>> // subpages) >>>> COSObject pageObj = getPageObject(pageNr, kids, 0); >>>> >>>> if (pageObj == null) >>>> { >>>> throw new IOException("Page " + pageNr + " not found."); >>>> } >>>> >>>> // ---- parse all objects necessary to load page. >>>> COSDictionary pageDict = (COSDictionary) pageObj.getObject(); >>>> >>>> if (parseMinimalCatalog && (!allPagesParsed)) >>>> { >>>> // parse page resources since we did not do this on start >>>> COSDictionary resDict = (COSDictionary) >>>> pageDict.getDictionaryObject(COSName.RESOURCES); >>>> parseDictObjects(resDict); >>>> } >>>> >>>> return new PDPage(pageDict); >>>> } >>>> >>>> >>>> >>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> PDF is a random access format with key information (the Cross Reference >>>>> where to find the objects) being at the end of the file and the PDF >>> objects >>>>> spread around the file. >>>>> >>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq >>>>> instead of PDDocument.load and set the system property >>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which >>> does >>>>> a minimal parsing of the PDF. That could reduce the memory consumption a >>>>> little bit. Unfortunately once an object has been parsed it’s content >>>>> stays in memory so you would need to do a low level parsing yourself >>> with >>>>> the information available from the initial parsing stage. >>>>> >>>>> Maruan Sahyoun >>>>> >>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø < >>>>> [email protected]>: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs) >>>>>> according to the following rule set: >>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm) >>>>>> - There should be no content within a certain rectangular area of a >>> page >>>>>> (left margin where the print shop inserts a bar code) >>>>>> - Number of pages should be less than N >>>>>> - PDF version used >>>>>> >>>>>> So far we've been using >>>>>> >>>>>> PDDocument.load with a scratch file, but with huge documents (e.g. >>>>> product >>>>>> catalogues), things explode. >>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML >>>>>> document (e.g. using StAX) and validate one page at a time? >>>>>> >>>>>> Cheers >>>>>> >>>>>> Stefan >>>>> >>>>> >>>> >>>> >>>> -- >>>> BEKK Open >>>> http://open.bekk.no >>>> >>>> TesTcl - a unit test framework for iRules >>>> http://testcl.com >>> >>> >> >> >> -- >> BEKK Open >> http://open.bekk.no >> >> TesTcl - a unit test framework for iRules >> http://testcl.com >

