Here it is: https://issues.apache.org/jira/browse/PDFBOX-1965
Thanks Stefan 2014-03-07 12:47 GMT+01:00 Maruan Sahyoun <[email protected]>: > Hi Stefan, > > unfortunately this is seems to be a bug. When the parseMinimal property is > set to true indirect objects are not followed when the PDF is parsed. May I > ask you to file a issue in Jira [ > https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in > question. > > BR > Maruan Sahyoun > > Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <[email protected]>: > > > Hi Stefan, > > > > just fine. If I need more information I’ll let you know. > > > > BR > > Maruan Sahyoun > > > > Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø < > [email protected]>: > > > >> Hi Maruan, > >> > >> So I created a small maven project containing a PDF-file I just > generated > >> on my mac, and pushed it to https://github.com/landro/pdfboxbug > >> I could create a zip and upload to your bugtracker, but that feels kinda > >> awkward. > >> What do you prefer? > >> > >> Stefan > >> > >> > >> > >> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>: > >> > >>> Yes please, file a bug report together with a sample PDF and sample > code > >>> to reproduce the issue. Which PDFBox version are you using? > >>> > >>> BR > >>> Maruan Sahyoun > >>> > >>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø < > >>> [email protected]>: > >>> > >>>> Hi there, > >>>> > >>>> So I tried using the NonSequentialParser setting the > >>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal > property > >>> to > >>>> true. > >>>> > >>>> The memory footprint looks much better, however, I can't get the > >>> individual > >>>> pages due to a NPE in the getPage code. > >>>> > >>>> It turns out the resDict below is mostly null - which again causes a > NPE > >>> in > >>>> parseDictObjects. > >>>> > >>>> Should I file a bug? > >>>> > >>>> Stefan > >>>> > >>>> > >>>> public PDPage getPage(int pageNr) throws IOException > >>>> { > >>>> getPagesObject(); > >>>> > >>>> // ---- get list of top level pages > >>>> COSArray kids = (COSArray) > >>>> pagesDictionary.getDictionaryObject(COSName.KIDS); > >>>> > >>>> if (kids == null) > >>>> { > >>>> throw new IOException("Missing 'Kids' entry in pages > >>>> dictionary."); > >>>> } > >>>> > >>>> // ---- get page we are looking for (possibly going recursively > >>> into > >>>> // subpages) > >>>> COSObject pageObj = getPageObject(pageNr, kids, 0); > >>>> > >>>> if (pageObj == null) > >>>> { > >>>> throw new IOException("Page " + pageNr + " not found."); > >>>> } > >>>> > >>>> // ---- parse all objects necessary to load page. > >>>> COSDictionary pageDict = (COSDictionary) pageObj.getObject(); > >>>> > >>>> if (parseMinimalCatalog && (!allPagesParsed)) > >>>> { > >>>> // parse page resources since we did not do this on start > >>>> COSDictionary resDict = (COSDictionary) > >>>> pageDict.getDictionaryObject(COSName.RESOURCES); > >>>> parseDictObjects(resDict); > >>>> } > >>>> > >>>> return new PDPage(pageDict); > >>>> } > >>>> > >>>> > >>>> > >>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>: > >>>> > >>>>> Hi, > >>>>> > >>>>> PDF is a random access format with key information (the Cross > Reference > >>>>> where to find the objects) being at the end of the file and the PDF > >>> objects > >>>>> spread around the file. > >>>>> > >>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq > >>>>> instead of PDDocument.load and set the system property > >>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which > >>> does > >>>>> a minimal parsing of the PDF. That could reduce the memory > consumption a > >>>>> little bit. Unfortunately once an object has been parsed it’s > content > >>>>> stays in memory so you would need to do a low level parsing yourself > >>> with > >>>>> the information available from the initial parsing stage. > >>>>> > >>>>> Maruan Sahyoun > >>>>> > >>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø < > >>>>> [email protected]>: > >>>>> > >>>>>> Hi there, > >>>>>> > >>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs) > >>>>>> according to the following rule set: > >>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm) > >>>>>> - There should be no content within a certain rectangular area of a > >>> page > >>>>>> (left margin where the print shop inserts a bar code) > >>>>>> - Number of pages should be less than N > >>>>>> - PDF version used > >>>>>> > >>>>>> So far we've been using > >>>>>> > >>>>>> PDDocument.load with a scratch file, but with huge documents (e.g. > >>>>> product > >>>>>> catalogues), things explode. > >>>>>> Is there a way to stream parse a PDF similar to stream parsing an > XML > >>>>>> document (e.g. using StAX) and validate one page at a time? > >>>>>> > >>>>>> Cheers > >>>>>> > >>>>>> Stefan > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> BEKK Open > >>>> http://open.bekk.no > >>>> > >>>> TesTcl - a unit test framework for iRules > >>>> http://testcl.com > >>> > >>> > >> > >> > >> -- > >> BEKK Open > >> http://open.bekk.no > >> > >> TesTcl - a unit test framework for iRules > >> http://testcl.com > > > > -- BEKK Open http://open.bekk.no TesTcl - a unit test framework for iRules http://testcl.com

