RE: PDFBox 1.8.4 and pdf's generated by MS Word

Tim Costermans Mon, 31 Mar 2014 03:58:26 -0700

Hello,

I've written a test case to reproduce the issue. (see patch)


Could someone have a look at it and give me some pointers on how to solve this 
issue? I applied this patch on the 1.8.4 tag I checked out locally.
The issue is that I don't know the pdf spec, so I don't know how to fix this 
issue in the PDFBOX source code.

Word2010.pdf is the input pdf, I open the document with PDFBOX add a string to 
the pdf. In this case 'Hello world!'.
Afterwards I save the pdf.

If I look at the content of the pdf before and after I modified it (using 
Notepad++) I see this:

Word2010.pdf:
Line 647: <</Size 18/Root 1 0 R/Info 7 0 
R/ID[<AE9AF29D5A22AE47B47C4DA29170BE64><AE9AF29D5A22AE47B47C4DA29170BE64>] 
/Prev 81972/XRefStm 81702>>

modified_Word2010.pdf:
Line 791: /XRefStm 81702

XRefStm is not updated although the original pdf had multiple revisions that 
were merged into a new pdf document.

A third party library we use defends on this XRefStm value and cannot open the 
pdf after it was modified. (Stack trace see previous msg)
Any help would be much appreciated.

Kind regards,

Tim Costermans

From: Tim Costermans
Sent: woensdag 26 maart 2014 14:31
To: '[email protected]'
Subject: PDFBox 1.8.4 and pdf's generated by MS Word

Hello,

It' seems that pdf's generated by MS Word 2010 or 2013 are a recipe for trouble 
in combination with PDFBOX version 1.8.0 or 1.8.4.
I upgrade to PDFBOX 1.8.4 and one issue remains:
Caused by: **thirdparty.pdf.exceptions.PDFParsingException: 
[offset=91308]Expected numeric object for object number
                        at 
**thirdparty.pdf.exceptions.PDFParsingException.newInstance(PDFParsingException.java:58)
                        at 
**thirdparty.pdf.io.PDFParser.throwEx(PDFParser.java:1215)
                        at 
**thirdparty.pdf.io.PDFParser.readCompressedCrossRefTable(PDFParser.java:805)
                        at 
**thirdparty.pdf.io.PDFParser.readCrossRefTable(PDFParser.java:1175)
                        at 
**thirdparty.pdf.PDFDocument.open(PDFDocument.java:154)
                        at **thirdparty.PDFDocument.open(PDFDocument.java:124)
                        at 
com.*****.sign.pdf.PDFPresigner.presign(PDFPresigner.java:24)
                        ... 26 more

How to reproduce:
1) Fire up MS Word v 2010 , type some text, save as PDF.
2) Open this pdf file with Notepad++, you will notice the following at the 
bottom of the file:
...
trailer
<</Size 18/Root 1 0 R/Info 7 0 
R/ID[<7AE435CBC968B94F8B050F40F6D5CE5F><7AE435CBC968B94F8B050F40F6D5CE5F>] >>
startxref
82089
%%EOF
xref
0 0
trailer
<</Size 18/Root 1 0 R/Info 7 0 
R/ID[<7AE435CBC968B94F8B050F40F6D5CE5F><7AE435CBC968B94F8B050F40F6D5CE5F>] 
/Prev 82089/XRefStm 81819>>
startxref
82605
%%EOF

Our application is trying to add an image to this pdf using PDFBox, when 
calling PDFDocument.save() the "revisions" are merged an a new pdf is being 
created.
The newly created pdf is being passed to a third party that tries to open it, 
but it fails because XRefStm is not correctly updated during save.
Probably related to https://issues.apache.org/jira/browse/PDFBOX-1822

I also tried PDFDocument.incrementalSave() but then I get into a nullpointer 
exception cuased by  PDFXRefStream:  List<Integer> indexEntry = 
getIndexEntry(); containing two null objects. (first and last still being null 
and being added to the list).
How do I solve this issue?
What's the real issue here?
I'm not in control of the pdf's that the application can receive.

Also ran into the following bug but worked around it 
https://issues.apache.org/jira/browse/PDFBOX-1838 .

RE: PDFBox 1.8.4 and pdf's generated by MS Word

Reply via email to