Hi,
Maybe these files will work, but others won't. I just tested it, the
file from
https://issues.apache.org/jira/browse/PDFBOX-2163 fails.
hasNoFollowingBinData() tries to find out whether "EI" is followed by
typical PDF operators (then it means the "EI" is "real"), or by binary
data (then it means that this "EI" is itself binary data). You can't
share the documents, but you could try to extract the content stream
with PDFDebugger:
, then go to the offset mentioned (on windows use NOTEPAD++) and copy a
few bytes and post them here (make sure not to miss "invisible"
characters, look at the hex codes). Because you said it is better
without the 3 lines, this means that noBinData is true, thus the "EI"
was really the end. It would be interesting to see the next 10 bytes
after EI. If you can't share these bytes, then consider suggesting a
code change.
Tilman
On 14.02.2025 17:00, Petersam, John Contractor wrote:
Hello,
My system, which is running PDFBox 3.0.3, processes about 325k PDF documents every day,
so we see a lot of odd things. Of those, about 600 will have "ignoring 'EI' assumed
to be in the middle of inline image at stream offset" in their log files. Most of
the time, this warning simply results in a random line being displayed and does not
affect our user experience. However, sometimes it results in the entirety of our text
not being displayed.
The documents do display properly in Acrobat, but have the same issue with
pdf.js as well as the Python libraries I have tried. In an effort to learn
more about the issue, I downloaded the source code and determined that the PDFs
are failing in the hasNoFollowingBinData function in PDFStreamParser.
if (endOpIdx != -1 && startOpIdx != -1 && endOpIdx - startOpIdx
> 3)
{
noBinData = false;
}
What's interesting is that if I comment those lines out, I am able to render
the PDF properly in PDFBox, which in turn allows me create a document which is
can then be rendered by the other libraries.
I would like to know what this code is doing, and what the ramifications are
for removing it. After all, it was written for a reason and the other
libraries appear to have some similar check.
Any help would be greatly appreciated. Unfortunately, due to PII reasons I
cannot share the original documents.
Thanks,
John Petersam