Hi,

you are using an ancient version of PDFBox. 1.8.12 was released in 2016. You should update at least to a recent 2.0 version, but the prefered version is 3.0.5

Andreas

Am 14.08.25 um 16:24 schrieb Poisson, David (DGRI):
[Intranet logo]
Hi everyone! Sorry for the lengthy description, but I didn't want to leave out 
crucial details.

In one of our legacy system, we use PDFBox 1.8.12 (on Corretto JRE 11) to load 
a PDF (converted before reaching this system) and strip the text for processing.
The PDFs received by the legacy system are generated by a separate system that 
uses a 3rd party library (version 3 is in production).

We are in the process of migrating to a newer version of this library (version 
4) and have started doing regression testing.
We have found that about 450 PDF/A documents out of about 11500 test documents 
fail to be opened by PDFBox 1.8.12.
One such PDF generates the following exception when read:
[multiple stream length is wrong message]
2025-07-25 15:45:40,763 [main] WARN  org.apache.pdfbox.pdfparser.BaseParser - 
Specified stream length 3686 is wrong. Fall back to reading stream until 
'endstream'.
[multiple stream length is wrong message]
java.lang.Throwable: java.io.IOException: expected='endstream' actual='' at 
offset 93525
         [company internal class stack redacted]
Caused by: java.io.IOException: expected='endstream' actual='' at offset 93525
         at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609) 
~[pdfbox-1.8.12.jar:1.8.12]
         at 
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:650) 
~[pdfbox-1.8.12.jar:1.8.12]
         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) 
~[pdfbox-1.8.12.jar:1.8.12]
         [company internal class stack redacted]
         ... 9 more
The reason why I left the message "Specified stream length 3686 is wrong" in the above 
output is because the offset 93525 seems to fall in the "/Filter /FlateDecode" object 
that has a declared /Length of 3686.

I have found the following existing issue that closely ressembles my situation: 
https://issues.apache.org/jira/browse/PDFBOX-4704

As per issue PDFBOX-4704, I have tried to update to the latest 1.8 version 
(1.8.17) and replace:
document = PDDocument.load(iStream);

By the following:
RandomAccessBuffer buffer = new RandomAccessBuffer();
document = PDDocument.loadNonSeq(iStream, buffer);

When I try to process the PDF, I get a different exception:
2025-07-29 13:54:05,987 [main] ERROR 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser - The end of the stream is 
out of range, using workaround to read the stream
2025-07-29 13:54:05,987 [main] ERROR 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser - Stream start offset: 92988
2025-07-29 13:54:05,987 [main] ERROR 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser - Expected endofstream 
offset: 93714
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - 
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] INFO  [internal class] - Exception getMessage(): 
null
2025-07-29 13:54:05,992 [main] INFO  [internal class]- Exception getCause(): 
java.util.zip.DataFormatException: too many length or distance symbols
java.lang.Throwable: java.io.IOException
         [internal stack class redacted]
Caused by: java.io.IOException
         at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:108) 
~[pdfbox-1.8.17.jar:1.8.17]
         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379) 
~[pdfbox-1.8.17.jar:1.8.17]
         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291) 
~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:225) 
~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61)
 ~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:976) 
~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:667)
 ~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXref(NonSequentialPDFParser.java:621)
 ~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:351)
 ~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:928)
 ~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1332) 
~[pdfbox-1.8.17.jar:1.8.17]
         at 
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1315) 
~[pdfbox-1.8.17.jar:1.8.17]
         [internal class stack redacted]
         ... 11 more

If I open the PDF in Acrobat and save it as a PDF/A, then the resulting PDF 
file can be opened with PDFBox 1.8.12 without a hitch.

We passed the PDF in a validator and we get the following report both the 
original file and after exporting in Acrobat (exact same report):
Checking against conformance level PDF/A-1a
Category: Format Message: The file contains cross reference streams. Context: 
file PageNo: N/A
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid 
value '2'. Required is '1'. Context: document metadata PageNo: N/A
Category: Metadata Message: The dictionary must not contain the key 'Filter'. 
Context: metadata of font file of font 'ABCDEE+CG Times' PageNo: 1
Category: Font Message: The key CIDSet is required but missing. Context: font 
descriptor of font 'ABCDEE+CG Times' PageNo: 1
False

Checking against conformance level PDF/A-1b
Category: Format Message: The file contains cross reference streams. Context: 
file PageNo: N/A
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid 
value '2'. Required is '1'. Context: document metadata PageNo: N/A
Category: Metadata Message: The dictionary must not contain the key 'Filter'. 
Context: metadata of font file of font 'ABCDEE+CG Times' PageNo: 1
Category: Font Message: The key CIDSet is required but missing. Context: font 
descriptor of font 'ABCDEE+CG Times' PageNo: 1
False

Checking against conformance level PDF/A-2a
True

Checking against conformance level PDF/A-2b
True

Checking against conformance level PDF/A-2u
True

Checking against conformance level PDF/A-3a
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid 
value '2'. Required is '3'. Context: document metadata PageNo: N/A
False

Checking against conformance level PDF/A-3b
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid 
value '2'. Required is '3'. Context: document metadata PageNo: N/A
False

Checking against conformance level PDF/A-3u
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid 
value '2'. Required is '3'. Context: document metadata PageNo: N/A
False

Of particular interest are the messages about containing cross reference 
streams.

Here are the PDF's in question (didn't want to add 3 PDF's to the email, so 
here's a link to my google drive's folder that has all 3 PDF's):
https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing
v3.PDF: conversion result using version 3 of our conversion library, works well 
in PDFBox 1.8.12
v4.PDF: conversion result using version 4 of our conversion library, gives 
errors in PDFBox
v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works well in 
PDFBox 1.8.12

I'm running out of ideas of where to look for the problem/solution: Is the 
generated PDF corrupt or is it a PDFBox bug?

David Poisson





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to