Hi Maruan, So I created a small maven project containing a PDF-file I just generated on my mac, and pushed it to https://github.com/landro/pdfboxbug I could create a zip and upload to your bugtracker, but that feels kinda awkward. What do you prefer?
Stefan 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>: > Yes please, file a bug report together with a sample PDF and sample code > to reproduce the issue. Which PDFBox version are you using? > > BR > Maruan Sahyoun > > Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø < > [email protected]>: > > > Hi there, > > > > So I tried using the NonSequentialParser setting the > > org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property > to > > true. > > > > The memory footprint looks much better, however, I can't get the > individual > > pages due to a NPE in the getPage code. > > > > It turns out the resDict below is mostly null - which again causes a NPE > in > > parseDictObjects. > > > > Should I file a bug? > > > > Stefan > > > > > > public PDPage getPage(int pageNr) throws IOException > > { > > getPagesObject(); > > > > // ---- get list of top level pages > > COSArray kids = (COSArray) > > pagesDictionary.getDictionaryObject(COSName.KIDS); > > > > if (kids == null) > > { > > throw new IOException("Missing 'Kids' entry in pages > > dictionary."); > > } > > > > // ---- get page we are looking for (possibly going recursively > into > > // subpages) > > COSObject pageObj = getPageObject(pageNr, kids, 0); > > > > if (pageObj == null) > > { > > throw new IOException("Page " + pageNr + " not found."); > > } > > > > // ---- parse all objects necessary to load page. > > COSDictionary pageDict = (COSDictionary) pageObj.getObject(); > > > > if (parseMinimalCatalog && (!allPagesParsed)) > > { > > // parse page resources since we did not do this on start > > COSDictionary resDict = (COSDictionary) > > pageDict.getDictionaryObject(COSName.RESOURCES); > > parseDictObjects(resDict); > > } > > > > return new PDPage(pageDict); > > } > > > > > > > > 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>: > > > >> Hi, > >> > >> PDF is a random access format with key information (the Cross Reference > >> where to find the objects) being at the end of the file and the PDF > objects > >> spread around the file. > >> > >> You can use the NonSequentialParser by calling PDDocument.loadNonSeq > >> instead of PDDocument.load and set the system property > >> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which > does > >> a minimal parsing of the PDF. That could reduce the memory consumption a > >> little bit. Unfortunately once an object has been parsed it’s content > >> stays in memory so you would need to do a low level parsing yourself > with > >> the information available from the initial parsing stage. > >> > >> Maruan Sahyoun > >> > >> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø < > >> [email protected]>: > >> > >>> Hi there, > >>> > >>> I'm trying to validate random pdfs (potentially huge - 100s of MBs) > >>> according to the following rule set: > >>> - Dimensions of all pages should be A4 (297 mm * 210 mm) > >>> - There should be no content within a certain rectangular area of a > page > >>> (left margin where the print shop inserts a bar code) > >>> - Number of pages should be less than N > >>> - PDF version used > >>> > >>> So far we've been using > >>> > >>> PDDocument.load with a scratch file, but with huge documents (e.g. > >> product > >>> catalogues), things explode. > >>> Is there a way to stream parse a PDF similar to stream parsing an XML > >>> document (e.g. using StAX) and validate one page at a time? > >>> > >>> Cheers > >>> > >>> Stefan > >> > >> > > > > > > -- > > BEKK Open > > http://open.bekk.no > > > > TesTcl - a unit test framework for iRules > > http://testcl.com > > -- BEKK Open http://open.bekk.no TesTcl - a unit test framework for iRules http://testcl.com

