RE: PDFBox 1.8.4 and pdf's generated by MS Word

Tim Costermans Mon, 31 Mar 2014 08:01:39 -0700

Hi Muruan,

Thx for pointing out the attachments didn't get through.
2 pdf files and 1 patch file (containing test case to reproduce issue) are 
available here: https://www.dropbox.com/sh/291b24dstixowgt/aQTZl5j_pP


Kind regards,
Tim

-----Original Message-----
From: Maruan Sahyoun [mailto:[email protected]] 
Sent: maandag 31 maart 2014 16:47
To: [email protected]
Subject: Re: PDFBox 1.8.4 and pdf's generated by MS Word

Hi Tim,

the attachment didn't make it through - could you upload it to a public 
location?

BR

Maruan

Am 31.03.2014 um 12:56 schrieb Tim Costermans <[email protected]>:

> Hello,
>  
> I've written a test case to reproduce the issue. (see patch)
> 
> Could someone have a look at it and give me some pointers on how to solve 
> this issue? I applied this patch on the 1.8.4 tag I checked out locally.
> The issue is that I don't know the pdf spec, so I don't know how to fix this 
> issue in the PDFBOX source code.
>  
> Word2010.pdf is the input pdf, I open the document with PDFBOX add a string 
> to the pdf. In this case 'Hello world!'.
> Afterwards I save the pdf.
>  
> If I look at the content of the pdf before and after I modified it (using 
> Notepad++) I see this:
>  
> Word2010.pdf:
> Line 647: <</Size 18/Root 1 0 R/Info 7 0 
> R/ID[<AE9AF29D5A22AE47B47C4DA29170BE64><AE9AF29D5A22AE47B47C4DA29170BE
> 64>] /Prev 81972/XRefStm 81702>>
>  
> modified_Word2010.pdf:
> Line 791: /XRefStm 81702
>  
> XRefStm is not updated although the original pdf had multiple revisions that 
> were merged into a new pdf document.
>  
> A third party library we use defends on this XRefStm value and cannot 
> open the pdf after it was modified. (Stack trace see previous msg) Any help 
> would be much appreciated.
>  
> Kind regards,
>  
> Tim Costermans
>  
> From: Tim Costermans
> Sent: woensdag 26 maart 2014 14:31
> To: '[email protected]'
> Subject: PDFBox 1.8.4 and pdf's generated by MS Word
>  
> Hello,
>  
> It' seems that pdf's generated by MS Word 2010 or 2013 are a recipe for 
> trouble in combination with PDFBOX version 1.8.0 or 1.8.4.
> I upgrade to PDFBOX 1.8.4 and one issue remains:
> 
> Caused by: **thirdparty.pdf.exceptions.PDFParsingException: 
> [offset=91308]Expected numeric object for object number
>                         at 
> **thirdparty.pdf.exceptions.PDFParsingException.newInstance(PDFParsingException.java:58)
>                         at 
> **thirdparty.pdf.io.PDFParser.throwEx(PDFParser.java:1215)
>                         at 
> **thirdparty.pdf.io.PDFParser.readCompressedCrossRefTable(PDFParser.java:805)
>                         at 
> **thirdparty.pdf.io.PDFParser.readCrossRefTable(PDFParser.java:1175)
>                         at 
> **thirdparty.pdf.PDFDocument.open(PDFDocument.java:154)
>                         at **thirdparty.PDFDocument.open(PDFDocument.java:124)
>                         at 
> com.*****.sign.pdf.PDFPresigner.presign(PDFPresigner.java:24)
>                         ... 26 more
> 
> How to reproduce:
> 1) Fire up MS Word v 2010 , type some text, save as PDF.
> 2) Open this pdf file with Notepad++, you will notice the following at the 
> bottom of the file:
> ...
> trailer
> <</Size 18/Root 1 0 R/Info 7 0 
> R/ID[<7AE435CBC968B94F8B050F40F6D5CE5F><7AE435CBC968B94F8B050F40F6D5CE
> 5F>] >> startxref
> 82089
> %%EOF
> xref
> 0 0
> trailer
> <</Size 18/Root 1 0 R/Info 7 0 
> R/ID[<7AE435CBC968B94F8B050F40F6D5CE5F><7AE435CBC968B94F8B050F40F6D5CE
> 5F>] /Prev 82089/XRefStm 81819>> startxref
> 82605
> %%EOF
>  
> Our application is trying to add an image to this pdf using PDFBox, when 
> calling PDFDocument.save() the "revisions" are merged an a new pdf is being 
> created.
> The newly created pdf is being passed to a third party that tries to open it, 
> but it fails because XRefStm is not correctly updated during save.
> Probably related to https://issues.apache.org/jira/browse/PDFBOX-1822
>  
> I also tried PDFDocument.incrementalSave() but then I get into a nullpointer 
> exception cuased by  PDFXRefStream:  List<Integer> indexEntry = 
> getIndexEntry(); containing two null objects. (first and last still being 
> null and being added to the list).
> How do I solve this issue?
> What's the real issue here?
> I'm not in control of the pdf's that the application can receive.
>  
> Also ran into the following bug but worked around it 
> https://issues.apache.org/jira/browse/PDFBOX-1838 .

RE: PDFBox 1.8.4 and pdf's generated by MS Word

Reply via email to