[Intranet logo] Hi Tilman, appreciate your reply!
I'm not that familiar with the internal structure of PDF files, so I appreciate the fact that you confirm that they are valid. When I use "file" command in Linux, it does report all files are being application/pdf. I was also able to write a sample program which loads the PDF as per the PDFBox documentation. It works well. I pulled out all the statements that work with PDFBox objects from our legacy system's code and put it in a self-contained project and I think I can reproduce the problem. The input stream is used for 2 passes through the document. The first pass goes through all pages and determines the text location on each page. The second pass extracts all the text, which is then cleaned up (removing what are in the margins and top/bottom of pages). When we re-used the input stream in the second pass to create the Parser, that's when we get the error. I've added an inputStream.reset() in between the two passes in my self-contained project and the error goes away. I'm in the process of making the modification in our legacy system and will test it to see if that helps us with the PDF files that cannot be opened. What I can't explain though, is why some PDF files are going through without this reset()? Not sure what's the best way to share this project? I've put it up on my google drive (where the PDF's were). It's a java maven project called PDFLoader.zip (for convenience, I have the 3 PDF files at the root of the project) https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing When you run the main(), you'll get an exception with v4.PDF. Simply uncomment line 62 and it should work. David Poisson -----Message d'origine----- De : Tilman Hausherr <thaush...@t-online.de> Envoyé : 14 août 2025 11:59 À : users@pdfbox.apache.org Objet : Re: Getting IOException: expected: 'endstream' actual: '' at offset X Am 14.08.2025 um 16:24 schrieb Poisson, David (DGRI): > Here are the PDF's in question (didn't want to add 3 PDF's to the email, so > here's a link to my google drive's folder that has all 3 PDF's): > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriv > e.google.com%2Fdrive%2Ffolders%2F1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z%3Fu > sp%3Dsharing&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7C5f7e76cb > 23414628c8f808dddb4b7433%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C > 638907839441012283%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlY > iOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0% > 7C%7C%7C&sdata=%2Fjl5Z3feolFod9uH7AhacRPhhmPTJIYpbeMw55POnB8%3D&reserv > ed=0 > v3.PDF: conversion result using version 3 of our conversion library, > works well in PDFBox 1.8.12 > v4.PDF: conversion result using version 4 of our conversion library, > gives errors in PDFBox > v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works > well in PDFBox 1.8.12 I had no trouble doing a text extraction with 1.8.12, 1.8.16 and 1.8.17 on v3 and v4 using pdfbox-app. Makes me wonder if there's either a problem with PDFBox when using an input stream, or if something goes wrong when you read the file (maybe wrong mime type so it's passed as text) Re the PDF/A problems: Your file is a (correct) PDF/A-2a, and you checked it to be PDF/A-1b, which it isn't. Checking against conformance level PDF/A-2a True Checking against conformance level PDF/A-2b True Checking against conformance level PDF/A-2u True That's all you need! Tilman --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org