Hi there,
So I tried using the NonSequentialParser setting the
org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
true.
The memory footprint looks much better, however, I can't get the individual
pages due to a NPE in the getPage code.
It turns out the resDict below is mostly null - which again causes a NPE in
parseDictObjects.
Should I file a bug?
Stefan
public PDPage getPage(int pageNr) throws IOException
{
getPagesObject();
// ---- get list of top level pages
COSArray kids = (COSArray)
pagesDictionary.getDictionaryObject(COSName.KIDS);
if (kids == null)
{
throw new IOException("Missing 'Kids' entry in pages
dictionary.");
}
// ---- get page we are looking for (possibly going recursively into
// subpages)
COSObject pageObj = getPageObject(pageNr, kids, 0);
if (pageObj == null)
{
throw new IOException("Page " + pageNr + " not found.");
}
// ---- parse all objects necessary to load page.
COSDictionary pageDict = (COSDictionary) pageObj.getObject();
if (parseMinimalCatalog && (!allPagesParsed))
{
// parse page resources since we did not do this on start
COSDictionary resDict = (COSDictionary)
pageDict.getDictionaryObject(COSName.RESOURCES);
parseDictObjects(resDict);
}
return new PDPage(pageDict);
}
2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <[email protected]>:
> Hi,
>
> PDF is a random access format with key information (the Cross Reference
> where to find the objects) being at the end of the file and the PDF objects
> spread around the file.
>
> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> instead of PDDocument.load and set the system property
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
> a minimal parsing of the PDF. That could reduce the memory consumption a
> little bit. Unfortunately once an object has been parsed it’s content
> stays in memory so you would need to do a low level parsing yourself with
> the information available from the initial parsing stage.
>
> Maruan Sahyoun
>
> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> [email protected]>:
>
> > Hi there,
> >
> > I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> > according to the following rule set:
> > - Dimensions of all pages should be A4 (297 mm * 210 mm)
> > - There should be no content within a certain rectangular area of a page
> > (left margin where the print shop inserts a bar code)
> > - Number of pages should be less than N
> > - PDF version used
> >
> > So far we've been using
> >
> > PDDocument.load with a scratch file, but with huge documents (e.g.
> product
> > catalogues), things explode.
> > Is there a way to stream parse a PDF similar to stream parsing an XML
> > document (e.g. using StAX) and validate one page at a time?
> >
> > Cheers
> >
> > Stefan
>
>
--
BEKK Open
http://open.bekk.no
TesTcl - a unit test framework for iRules
http://testcl.com