Hi Stefan, just fine. If I need more information I’ll let you know.
BR Maruan Sahyoun Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <[email protected]>: > Hi Maruan, > > So I created a small maven project containing a PDF-file I just generated > on my mac, and pushed it to https://github.com/landro/pdfboxbug > I could create a zip and upload to your bugtracker, but that feels kinda > awkward. > What do you prefer? > > Stefan > > > > 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <[email protected]>: > >> Yes please, file a bug report together with a sample PDF and sample code >> to reproduce the issue. Which PDFBox version are you using? >> >> BR >> Maruan Sahyoun >> >> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø < >> [email protected]>: >> >>> Hi there, >>> >>> So I tried using the NonSequentialParser setting the >>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property >> to >>> true. >>> >>> The memory footprint looks much better, however, I can't get the >> individual >>> pages due to a NPE in the getPage code. >>> >>> It turns out the resDict below is mostly null - which again causes a NPE >> in >>> parseDictObjects. >>> >>> Should I file a bug? >>> >>> Stefan >>> >>> >>> public PDPage getPage(int pageNr) throws IOException >>> { >>> getPagesObject(); >>> >>> // ---- get list of top level pages >>> COSArray kids = (COSArray) >>> pagesDictionary.getDictionaryObject(COSName.KIDS); >>> >>> if (kids == null) >>> { >>> throw new IOException("Missing 'Kids' entry in pages >>> dictionary."); >>> } >>> >>> // ---- get page we are looking for (possibly going recursively >> into >>> // subpages) >>> COSObject pageObj = getPageObject(pageNr, kids, 0); >>> >>> if (pageObj == null) >>> { >>> throw new IOException("Page " + pageNr + " not found."); >>> } >>> >>> // ---- parse all objects necessary to load page. >>> COSDictionary pageDict = (COSDictionary) pageObj.getObject(); >>> >>> if (parseMinimalCatalog && (!allPagesParsed)) >>> { >>> // parse page resources since we did not do this on start >>> COSDictionary resDict = (COSDictionary) >>> pageDict.getDictionaryObject(COSName.RESOURCES); >>> parseDictObjects(resDict); >>> } >>> >>> return new PDPage(pageDict); >>> } >>> >>> >>> >>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>: >>> >>>> Hi, >>>> >>>> PDF is a random access format with key information (the Cross Reference >>>> where to find the objects) being at the end of the file and the PDF >> objects >>>> spread around the file. >>>> >>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq >>>> instead of PDDocument.load and set the system property >>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which >> does >>>> a minimal parsing of the PDF. That could reduce the memory consumption a >>>> little bit. Unfortunately once an object has been parsed it’s content >>>> stays in memory so you would need to do a low level parsing yourself >> with >>>> the information available from the initial parsing stage. >>>> >>>> Maruan Sahyoun >>>> >>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø < >>>> [email protected]>: >>>> >>>>> Hi there, >>>>> >>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs) >>>>> according to the following rule set: >>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm) >>>>> - There should be no content within a certain rectangular area of a >> page >>>>> (left margin where the print shop inserts a bar code) >>>>> - Number of pages should be less than N >>>>> - PDF version used >>>>> >>>>> So far we've been using >>>>> >>>>> PDDocument.load with a scratch file, but with huge documents (e.g. >>>> product >>>>> catalogues), things explode. >>>>> Is there a way to stream parse a PDF similar to stream parsing an XML >>>>> document (e.g. using StAX) and validate one page at a time? >>>>> >>>>> Cheers >>>>> >>>>> Stefan >>>> >>>> >>> >>> >>> -- >>> BEKK Open >>> http://open.bekk.no >>> >>> TesTcl - a unit test framework for iRules >>> http://testcl.com >> >> > > > -- > BEKK Open > http://open.bekk.no > > TesTcl - a unit test framework for iRules > http://testcl.com

