Yes please, file a bug report together with a sample PDF and sample code to reproduce the issue. Which PDFBox version are you using?
BR Maruan Sahyoun Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <[email protected]>: > Hi there, > > So I tried using the NonSequentialParser setting the > org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to > true. > > The memory footprint looks much better, however, I can't get the individual > pages due to a NPE in the getPage code. > > It turns out the resDict below is mostly null - which again causes a NPE in > parseDictObjects. > > Should I file a bug? > > Stefan > > > public PDPage getPage(int pageNr) throws IOException > { > getPagesObject(); > > // ---- get list of top level pages > COSArray kids = (COSArray) > pagesDictionary.getDictionaryObject(COSName.KIDS); > > if (kids == null) > { > throw new IOException("Missing 'Kids' entry in pages > dictionary."); > } > > // ---- get page we are looking for (possibly going recursively into > // subpages) > COSObject pageObj = getPageObject(pageNr, kids, 0); > > if (pageObj == null) > { > throw new IOException("Page " + pageNr + " not found."); > } > > // ---- parse all objects necessary to load page. > COSDictionary pageDict = (COSDictionary) pageObj.getObject(); > > if (parseMinimalCatalog && (!allPagesParsed)) > { > // parse page resources since we did not do this on start > COSDictionary resDict = (COSDictionary) > pageDict.getDictionaryObject(COSName.RESOURCES); > parseDictObjects(resDict); > } > > return new PDPage(pageDict); > } > > > > 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>: > >> Hi, >> >> PDF is a random access format with key information (the Cross Reference >> where to find the objects) being at the end of the file and the PDF objects >> spread around the file. >> >> You can use the NonSequentialParser by calling PDDocument.loadNonSeq >> instead of PDDocument.load and set the system property >> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does >> a minimal parsing of the PDF. That could reduce the memory consumption a >> little bit. Unfortunately once an object has been parsed it’s content >> stays in memory so you would need to do a low level parsing yourself with >> the information available from the initial parsing stage. >> >> Maruan Sahyoun >> >> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø < >> [email protected]>: >> >>> Hi there, >>> >>> I'm trying to validate random pdfs (potentially huge - 100s of MBs) >>> according to the following rule set: >>> - Dimensions of all pages should be A4 (297 mm * 210 mm) >>> - There should be no content within a certain rectangular area of a page >>> (left margin where the print shop inserts a bar code) >>> - Number of pages should be less than N >>> - PDF version used >>> >>> So far we've been using >>> >>> PDDocument.load with a scratch file, but with huge documents (e.g. >> product >>> catalogues), things explode. >>> Is there a way to stream parse a PDF similar to stream parsing an XML >>> document (e.g. using StAX) and validate one page at a time? >>> >>> Cheers >>> >>> Stefan >> >> > > > -- > BEKK Open > http://open.bekk.no > > TesTcl - a unit test framework for iRules > http://testcl.com

