Hi,
you are using an ancient version of PDFBox. 1.8.12 was released in 2016.
You should update at least to a recent 2.0 version, but the prefered
version is 3.0.5
Andreas
Am 14.08.25 um 16:24 schrieb Poisson, David (DGRI):
[Intranet logo]
Hi everyone! Sorry for the lengthy description, but I didn't want to leave out
crucial details.
In one of our legacy system, we use PDFBox 1.8.12 (on Corretto JRE 11) to load
a PDF (converted before reaching this system) and strip the text for processing.
The PDFs received by the legacy system are generated by a separate system that
uses a 3rd party library (version 3 is in production).
We are in the process of migrating to a newer version of this library (version
4) and have started doing regression testing.
We have found that about 450 PDF/A documents out of about 11500 test documents
fail to be opened by PDFBox 1.8.12.
One such PDF generates the following exception when read:
[multiple stream length is wrong message]
2025-07-25 15:45:40,763 [main] WARN org.apache.pdfbox.pdfparser.BaseParser -
Specified stream length 3686 is wrong. Fall back to reading stream until
'endstream'.
[multiple stream length is wrong message]
java.lang.Throwable: java.io.IOException: expected='endstream' actual='' at
offset 93525
[company internal class stack redacted]
Caused by: java.io.IOException: expected='endstream' actual='' at offset 93525
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
~[pdfbox-1.8.12.jar:1.8.12]
at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:650)
~[pdfbox-1.8.12.jar:1.8.12]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
~[pdfbox-1.8.12.jar:1.8.12]
[company internal class stack redacted]
... 9 more
The reason why I left the message "Specified stream length 3686 is wrong" in the above
output is because the offset 93525 seems to fall in the "/Filter /FlateDecode" object
that has a declared /Length of 3686.
I have found the following existing issue that closely ressembles my situation:
https://issues.apache.org/jira/browse/PDFBOX-4704
As per issue PDFBOX-4704, I have tried to update to the latest 1.8 version
(1.8.17) and replace:
document = PDDocument.load(iStream);
By the following:
RandomAccessBuffer buffer = new RandomAccessBuffer();
document = PDDocument.loadNonSeq(iStream, buffer);
When I try to process the PDF, I get a different exception:
2025-07-29 13:54:05,987 [main] ERROR
org.apache.pdfbox.pdfparser.NonSequentialPDFParser - The end of the stream is
out of range, using workaround to read the stream
2025-07-29 13:54:05,987 [main] ERROR
org.apache.pdfbox.pdfparser.NonSequentialPDFParser - Stream start offset: 92988
2025-07-29 13:54:05,987 [main] ERROR
org.apache.pdfbox.pdfparser.NonSequentialPDFParser - Expected endofstream
offset: 93714
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter -
FlateFilter: stop reading corrupt stream due to a DataFormatException
2025-07-29 13:54:05,992 [main] INFO [internal class] - Exception getMessage():
null
2025-07-29 13:54:05,992 [main] INFO [internal class]- Exception getCause():
java.util.zip.DataFormatException: too many length or distance symbols
java.lang.Throwable: java.io.IOException
[internal stack class redacted]
Caused by: java.io.IOException
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:108)
~[pdfbox-1.8.17.jar:1.8.17]
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
~[pdfbox-1.8.17.jar:1.8.17]
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:225)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:976)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:667)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXref(NonSequentialPDFParser.java:621)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:351)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:928)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1332)
~[pdfbox-1.8.17.jar:1.8.17]
at
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1315)
~[pdfbox-1.8.17.jar:1.8.17]
[internal class stack redacted]
... 11 more
If I open the PDF in Acrobat and save it as a PDF/A, then the resulting PDF
file can be opened with PDFBox 1.8.12 without a hitch.
We passed the PDF in a validator and we get the following report both the
original file and after exporting in Acrobat (exact same report):
Checking against conformance level PDF/A-1a
Category: Format Message: The file contains cross reference streams. Context:
file PageNo: N/A
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid
value '2'. Required is '1'. Context: document metadata PageNo: N/A
Category: Metadata Message: The dictionary must not contain the key 'Filter'.
Context: metadata of font file of font 'ABCDEE+CG Times' PageNo: 1
Category: Font Message: The key CIDSet is required but missing. Context: font
descriptor of font 'ABCDEE+CG Times' PageNo: 1
False
Checking against conformance level PDF/A-1b
Category: Format Message: The file contains cross reference streams. Context:
file PageNo: N/A
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid
value '2'. Required is '1'. Context: document metadata PageNo: N/A
Category: Metadata Message: The dictionary must not contain the key 'Filter'.
Context: metadata of font file of font 'ABCDEE+CG Times' PageNo: 1
Category: Font Message: The key CIDSet is required but missing. Context: font
descriptor of font 'ABCDEE+CG Times' PageNo: 1
False
Checking against conformance level PDF/A-2a
True
Checking against conformance level PDF/A-2b
True
Checking against conformance level PDF/A-2u
True
Checking against conformance level PDF/A-3a
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid
value '2'. Required is '3'. Context: document metadata PageNo: N/A
False
Checking against conformance level PDF/A-3b
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid
value '2'. Required is '3'. Context: document metadata PageNo: N/A
False
Checking against conformance level PDF/A-3u
Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid
value '2'. Required is '3'. Context: document metadata PageNo: N/A
False
Of particular interest are the messages about containing cross reference
streams.
Here are the PDF's in question (didn't want to add 3 PDF's to the email, so
here's a link to my google drive's folder that has all 3 PDF's):
https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing
v3.PDF: conversion result using version 3 of our conversion library, works well
in PDFBox 1.8.12
v4.PDF: conversion result using version 4 of our conversion library, gives
errors in PDFBox
v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works well in
PDFBox 1.8.12
I'm running out of ideas of where to look for the problem/solution: Is the
generated PDF corrupt or is it a PDFBox bug?
David Poisson
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org