[Intranet logo] Hi everyone! Sorry for the lengthy description, but I didn't want to leave out crucial details.
In one of our legacy system, we use PDFBox 1.8.12 (on Corretto JRE 11) to load a PDF (converted before reaching this system) and strip the text for processing. The PDFs received by the legacy system are generated by a separate system that uses a 3rd party library (version 3 is in production). We are in the process of migrating to a newer version of this library (version 4) and have started doing regression testing. We have found that about 450 PDF/A documents out of about 11500 test documents fail to be opened by PDFBox 1.8.12. One such PDF generates the following exception when read: [multiple stream length is wrong message] 2025-07-25 15:45:40,763 [main] WARN org.apache.pdfbox.pdfparser.BaseParser - Specified stream length 3686 is wrong. Fall back to reading stream until 'endstream'. [multiple stream length is wrong message] java.lang.Throwable: java.io.IOException: expected='endstream' actual='' at offset 93525 [company internal class stack redacted] Caused by: java.io.IOException: expected='endstream' actual='' at offset 93525 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609) ~[pdfbox-1.8.12.jar:1.8.12] at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:650) ~[pdfbox-1.8.12.jar:1.8.12] at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) ~[pdfbox-1.8.12.jar:1.8.12] [company internal class stack redacted] ... 9 more The reason why I left the message "Specified stream length 3686 is wrong" in the above output is because the offset 93525 seems to fall in the "/Filter /FlateDecode" object that has a declared /Length of 3686. I have found the following existing issue that closely ressembles my situation: https://issues.apache.org/jira/browse/PDFBOX-4704 As per issue PDFBOX-4704, I have tried to update to the latest 1.8 version (1.8.17) and replace: document = PDDocument.load(iStream); By the following: RandomAccessBuffer buffer = new RandomAccessBuffer(); document = PDDocument.loadNonSeq(iStream, buffer); When I try to process the PDF, I get a different exception: 2025-07-29 13:54:05,987 [main] ERROR org.apache.pdfbox.pdfparser.NonSequentialPDFParser - The end of the stream is out of range, using workaround to read the stream 2025-07-29 13:54:05,987 [main] ERROR org.apache.pdfbox.pdfparser.NonSequentialPDFParser - Stream start offset: 92988 2025-07-29 13:54:05,987 [main] ERROR org.apache.pdfbox.pdfparser.NonSequentialPDFParser - Expected endofstream offset: 93714 2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,991 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] ERROR org.apache.pdfbox.filter.FlateFilter - FlateFilter: stop reading corrupt stream due to a DataFormatException 2025-07-29 13:54:05,992 [main] INFO [internal class] - Exception getMessage(): null 2025-07-29 13:54:05,992 [main] INFO [internal class]- Exception getCause(): java.util.zip.DataFormatException: too many length or distance symbols java.lang.Throwable: java.io.IOException [internal stack class redacted] Caused by: java.io.IOException at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:108) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:225) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:976) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:667) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXref(NonSequentialPDFParser.java:621) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:351) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:928) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1332) ~[pdfbox-1.8.17.jar:1.8.17] at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1315) ~[pdfbox-1.8.17.jar:1.8.17] [internal class stack redacted] ... 11 more If I open the PDF in Acrobat and save it as a PDF/A, then the resulting PDF file can be opened with PDFBox 1.8.12 without a hitch. We passed the PDF in a validator and we get the following report both the original file and after exporting in Acrobat (exact same report): Checking against conformance level PDF/A-1a Category: Format Message: The file contains cross reference streams. Context: file PageNo: N/A Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid value '2'. Required is '1'. Context: document metadata PageNo: N/A Category: Metadata Message: The dictionary must not contain the key 'Filter'. Context: metadata of font file of font 'ABCDEE+CG Times' PageNo: 1 Category: Font Message: The key CIDSet is required but missing. Context: font descriptor of font 'ABCDEE+CG Times' PageNo: 1 False Checking against conformance level PDF/A-1b Category: Format Message: The file contains cross reference streams. Context: file PageNo: N/A Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid value '2'. Required is '1'. Context: document metadata PageNo: N/A Category: Metadata Message: The dictionary must not contain the key 'Filter'. Context: metadata of font file of font 'ABCDEE+CG Times' PageNo: 1 Category: Font Message: The key CIDSet is required but missing. Context: font descriptor of font 'ABCDEE+CG Times' PageNo: 1 False Checking against conformance level PDF/A-2a True Checking against conformance level PDF/A-2b True Checking against conformance level PDF/A-2u True Checking against conformance level PDF/A-3a Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid value '2'. Required is '3'. Context: document metadata PageNo: N/A False Checking against conformance level PDF/A-3b Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid value '2'. Required is '3'. Context: document metadata PageNo: N/A False Checking against conformance level PDF/A-3u Category: Metadata Message: The XMP property 'pdfaid:part' has the invalid value '2'. Required is '3'. Context: document metadata PageNo: N/A False Of particular interest are the messages about containing cross reference streams. Here are the PDF's in question (didn't want to add 3 PDF's to the email, so here's a link to my google drive's folder that has all 3 PDF's): https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing v3.PDF: conversion result using version 3 of our conversion library, works well in PDFBox 1.8.12 v4.PDF: conversion result using version 4 of our conversion library, gives errors in PDFBox v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works well in PDFBox 1.8.12 I'm running out of ideas of where to look for the problem/solution: Is the generated PDF corrupt or is it a PDFBox bug? David Poisson