Am 15.08.2025 um 15:57 schrieb Poisson, David (DGRI):
[Intranet logo]

Hi Tilman, appreciate your reply!

I'm not that familiar with the internal structure of PDF files, so I appreciate 
the fact that you confirm that they are valid.
When I use "file" command in Linux, it does report all files are being 
application/pdf.

I was also able to write a sample program which loads the PDF as per the PDFBox 
documentation. It works well.

I pulled out all the statements that work with PDFBox objects from our legacy 
system's code and put it in a self-contained project and I think I can 
reproduce the problem.
The input stream is used for 2 passes through the document.
The first pass goes through all pages and determines the text location on each 
page.
The second pass extracts all the text, which is then cleaned up (removing what 
are in the margins and top/bottom of pages).

When we re-used the input stream in the second pass to create the Parser, 
that's when we get the error.

I've added an inputStream.reset() in between the two passes in my 
self-contained project and the error goes away.
I'm in the process of making the modification in our legacy system and will 
test it to see if that helps us with the PDF files that cannot be opened.
What I can't explain though, is why some PDF files are going through without 
this reset()?

I don't know, and I'm not curious enough to find out why incorrect code sometimes works.

You should be able to work with the same PDDocument object for both passes. That second pass is a bit weird. You don't need to call "new PDFParser(", this is very "old style". If you have a file in your production code, use that one. If you a byte array, use that one directly (in PDFBox 2.0 and 3.0). 3.0 should be faster with a file because it does parse on demand.

Tilman



Not sure what's the best way to share this project? I've put it up on my google 
drive (where the PDF's were).
It's a java maven project called PDFLoader.zip (for convenience, I have the 3 
PDF files at the root of the project)
https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing
When you run the main(), you'll get an exception with v4.PDF. Simply uncomment 
line 62 and it should work.

David Poisson


-----Message d'origine-----
De : Tilman Hausherr <thaush...@t-online.de>
Envoyé : 14 août 2025 11:59
À : users@pdfbox.apache.org
Objet : Re: Getting IOException: expected: 'endstream' actual: '' at offset X

Am 14.08.2025 um 16:24 schrieb Poisson, David (DGRI):
Here are the PDF's in question (didn't want to add 3 PDF's to the email, so 
here's a link to my google drive's folder that has all 3 PDF's):
https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriv
e.google.com%2Fdrive%2Ffolders%2F1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z%3Fu
sp%3Dsharing&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7C5f7e76cb
23414628c8f808dddb4b7433%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C
638907839441012283%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlY
iOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%
7C%7C%7C&sdata=%2Fjl5Z3feolFod9uH7AhacRPhhmPTJIYpbeMw55POnB8%3D&reserv
ed=0
v3.PDF: conversion result using version 3 of our conversion library,
works well in PDFBox 1.8.12
v4.PDF: conversion result using version 4 of our conversion library,
gives errors in PDFBox
v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works
well in PDFBox 1.8.12
I had no trouble doing a text extraction with 1.8.12, 1.8.16 and 1.8.17 on v3 
and v4 using pdfbox-app. Makes me wonder if there's either a problem with 
PDFBox when using an input stream, or if something goes wrong when you read the 
file (maybe wrong mime type so it's passed as text)

Re the PDF/A problems:

Your file is a (correct) PDF/A-2a, and you checked it to be PDF/A-1b, which it 
isn't.

     Checking against conformance level PDF/A-2a
     True

     Checking against conformance level PDF/A-2b
     True

     Checking against conformance level PDF/A-2u
     True

That's all you need!

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to