Hi, PDF is a random access format with key information (the Cross Reference where to find the objects) being at the end of the file and the PDF objects spread around the file.
You can use the NonSequentialParser by calling PDDocument.loadNonSeq instead of PDDocument.load and set the system property org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does a minimal parsing of the PDF. That could reduce the memory consumption a little bit. Unfortunately once an object has been parsed it’s content stays in memory so you would need to do a low level parsing yourself with the information available from the initial parsing stage. Maruan Sahyoun Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <[email protected]>: > Hi there, > > I'm trying to validate random pdfs (potentially huge - 100s of MBs) > according to the following rule set: > - Dimensions of all pages should be A4 (297 mm * 210 mm) > - There should be no content within a certain rectangular area of a page > (left margin where the print shop inserts a bar code) > - Number of pages should be less than N > - PDF version used > > So far we've been using > > PDDocument.load with a scratch file, but with huge documents (e.g. product > catalogues), things explode. > Is there a way to stream parse a PDF similar to stream parsing an XML > document (e.g. using StAX) and validate one page at a time? > > Cheers > > Stefan

