Am 14.08.2025 um 16:24 schrieb Poisson, David (DGRI):
Here are the PDF's in question (didn't want to add 3 PDF's to the email, so
here's a link to my google drive's folder that has all 3 PDF's):
https://drive.google.com/drive/folders/1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z?usp=sharing
v3.PDF: conversion result using version 3 of our conversion library, works well
in PDFBox 1.8.12
v4.PDF: conversion result using version 4 of our conversion library, gives
errors in PDFBox
v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works well in
PDFBox 1.8.12
I had no trouble doing a text extraction with 1.8.12, 1.8.16 and 1.8.17
on v3 and v4 using pdfbox-app. Makes me wonder if there's either a
problem with PDFBox when using an input stream, or if something goes
wrong when you read the file (maybe wrong mime type so it's passed as text)
Re the PDF/A problems:
Your file is a (correct) PDF/A-2a, and you checked it to be PDF/A-1b,
which it isn't.
Checking against conformance level PDF/A-2a
True
Checking against conformance level PDF/A-2b
True
Checking against conformance level PDF/A-2u
True
That's all you need!
Tilman