[Intranet logo]

Hi Tilman, appreciate your reply!

I'm not that familiar with the internal structure of PDF files, so I appreciate 
the fact that you confirm that they are valid.
When I use "file" command in Linux, it does report all files are being 
application/pdf.

I was also able to write a sample program which loads the PDF as per the PDFBox 
documentation. It works well.

I pulled out all the statements that work with PDFBox objects from our legacy 
system's code and put it in a self-contained project and I think I can 
reproduce the problem.
The input stream is used for 2 passes through the document.
The first pass goes through all pages and determines the text location on each 
page.
The second pass extracts all the text, which is then cleaned up (removing what 
are in the margins and top/bottom of pages).

When we re-used the input stream in the second pass to create the Parser, 
that's when we get the error.

I've added an inputStream.reset() in between the two passes in my 
self-contained project and the error goes away.
I'm in the process of making the modification in our legacy system and will 
test it to see if that helps us with the PDF files that cannot be opened.
What I can't explain though, is why some PDF files are going through without 
this reset()?

Not sure what's the best way to share this project? I've put it up on my google 
drive (where the PDF's were).
It's a java maven project called PDFLoader.zip (for convenience, I have the 3 
PDF files at the root of the project)
https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing
When you run the main(), you'll get an exception with v4.PDF. Simply uncomment 
line 62 and it should work.

David Poisson


-----Message d'origine-----
De : Tilman Hausherr <thaush...@t-online.de>
Envoyé : 14 août 2025 11:59
À : users@pdfbox.apache.org
Objet : Re: Getting IOException: expected: 'endstream' actual: '' at offset X

Am 14.08.2025 um 16:24 schrieb Poisson, David (DGRI):
> Here are the PDF's in question (didn't want to add 3 PDF's to the email, so 
> here's a link to my google drive's folder that has all 3 PDF's):
> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriv
> e.google.com%2Fdrive%2Ffolders%2F1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z%3Fu
> sp%3Dsharing&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7C5f7e76cb
> 23414628c8f808dddb4b7433%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C
> 638907839441012283%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlY
> iOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%
> 7C%7C%7C&sdata=%2Fjl5Z3feolFod9uH7AhacRPhhmPTJIYpbeMw55POnB8%3D&reserv
> ed=0
> v3.PDF: conversion result using version 3 of our conversion library,
> works well in PDFBox 1.8.12
> v4.PDF: conversion result using version 4 of our conversion library,
> gives errors in PDFBox
> v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works
> well in PDFBox 1.8.12

I had no trouble doing a text extraction with 1.8.12, 1.8.16 and 1.8.17 on v3 
and v4 using pdfbox-app. Makes me wonder if there's either a problem with 
PDFBox when using an input stream, or if something goes wrong when you read the 
file (maybe wrong mime type so it's passed as text)

Re the PDF/A problems:

Your file is a (correct) PDF/A-2a, and you checked it to be PDF/A-1b, which it 
isn't.

    Checking against conformance level PDF/A-2a
    True

    Checking against conformance level PDF/A-2b
    True

    Checking against conformance level PDF/A-2u
    True

That's all you need!

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to