Re: Memory use for large PDFs?

Adam Retter Sat, 03 Oct 2015 08:44:04 -0700

Sorry, for the delay I was away for a week.

> a) use a scratch file PDDocument.load(File file, boolean useScratchFiles)


I could not find a load method that has a boolean parameter to
indicate whether to use scratch files. However, If I use the
PDDocument#load(File file, RandomAccess scratchFile) and specify a
scratch file then I get an Exception which occurs for every page I
process. The Exception itself doesn't seem to cause any issue as the
resulting PDF looks correct, but it is disconcerting. The stacktrace
for the thrown exception looks like:

[error] Oct 03, 2015 11:10:50 AM org.apache.pdfbox.pdmodel.font.PDFont parseCmap
[error] SEVERE: An error occurs while reading a CMap
[error] java.io.IOException: Error: expected the end of a dictionary.
[error] at 
org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:432)
[error] at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:119)
[error] at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:626)
[error] at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.extractToUnicodeEncoding(PDSimpleFont.java:457)
[error] at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.determineEncoding(PDSimpleFont.java:411)
[error] at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:214)
[error] at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:89)
[error] at 
org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:67)
[error] at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
[error] at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:213)
[error] at org.apache.pdfbox.pdmodel.PDResources.addFont(PDResources.java:586)
[error] at 
org.apache.pdfbox.pdmodel.edit.PDPageContentStream.setFont(PDPageContentStream.java:321)



> b) don't use doc.getDocumentCatalog.getAllPages() as this fetches all pages 
> from the document but use PDDocumentCatalog.getPages() which only gives you 
> the root into the page tree (drawback is that you need to do the iteration 
> yourself). That has been enhanced in PDFBox 2.0.0 which also has an improved 
> resource handling.
>

I am just wondering how I do the iteration? Are there any examples?

If I use PDDocumentCatalog#getPages() then I get a PDPageNode, but
from there it looks like I have to call PDPageNode#getKids() which
then just gives me a list of all pages, so I can not see how this
would be any more efficient, can someone explain?

Also I see that PDFBox 2.0.0 is not yet released but does have an
iterator interface on PDPageTree. Is it already stable/reliable enough
to use?


-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Memory use for large PDFs?

Reply via email to