Re: Stream parsing huge PDF document in order to prevent memory issues

Maruan Sahyoun Fri, 14 Feb 2014 01:36:08 -0800

Hi,

PDF is a random access format with key information (the Cross Reference where 
to find the objects) being at the end of the file and the PDF objects spread 
around the file.


You can use the NonSequentialParser by calling PDDocument.loadNonSeq instead of 
PDDocument.load and set the system property 
org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does a 
minimal parsing of the PDF. That could reduce the memory consumption a little 
bit.  Unfortunately once an object has been parsed it’s content stays in memory 
so you would need to do a low level parsing yourself with the information 
available from the initial parsing stage.

Maruan Sahyoun

Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <[email protected]>:

> Hi there,
> 
> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> according to the following rule set:
> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> - There should be no content within a certain rectangular area of a page
> (left margin where the print shop inserts a bar code)
> - Number of pages should be less than N
> - PDF version used
> 
> So far we've been using
> 
> PDDocument.load with a scratch file, but with huge documents (e.g. product
> catalogues), things explode.
> Is there a way to stream parse a PDF similar to stream parsing an XML
> document (e.g. using StAX) and validate one page at a time?
> 
> Cheers
> 
> Stefan

Re: Stream parsing huge PDF document in order to prevent memory issues

Reply via email to