On 14.02.2025 19:06, Kevin Day wrote:
I'm sure this isn't helpful, but I can't help myself. Inline image streams
are one of the worst aspects of the PDF spec. Just absolutely awful. You
basically have to do a full decode on the embedded stream to figure out
where it ends, because the stream itself could contain the letters EI. So
they took a token oriented state machine parser and shoved a binary stream
operation in the middle of it that breaks the token processing.

/End Rant

I will be interested in where that is what is going on here (image stream
happens to contain EI).

Tillman - back when I was writing PDF content stream parsers, actually
doing the decoding was the only way I could find to handle BI/EI reliably.
I'm happy to be a resource if you want someone to bounce ideas around with.

Yes it's a terrible mess and the code to survive this isn't nice to look at. It's in hasNoFollowingBinData() in PDFStreamParser.

Tilman







On Fri, Feb 14, 2025, 9:53 AM Petersam, John Contractor
<john.peter...@ssa.gov.invalid> wrote:

Hi Tilman,

First, thanks for the quick response.  I follow several Apache group email
threads and you are by far the fastest responder in any of them.  Actually,
that alone doesn’t do you justice.  You’re a total ace and I (along with
everyone) really appreciate you and what you do for this community.



I followed your instructions and here is the EI and what immediately
follows it:



EI

0.00374532 0 0 0.0243902 0 -79.4878 cm

0 3124 25 45  re

f

0 0 0 rg

0.5 3150.5 2482 63  re

f*

2483 0 0 64 0 3150 cm

1 1 1 rg

/Im0 Do

0.000402739 0 0 0.015625 0 -49.2188 cm

0 3062 25 45  re

f

0 0 0 rg

0.5 3088.5 2482 63  re

f*

2483 0 0 64 0 3088 cm

1 1 1 rg

/Im1 Do



Please let me know if there is anything else I can do to help.



Thanks again,

John Petersam



*From:* Tilman Hausherr <thaush...@t-online.de>
*Sent:* Friday, February 14, 2025 11:34 AM
*To:* users@pdfbox.apache.org
*Subject:* [EXTERNAL] Re: ignoring 'EI' warning



Hi,

Maybe these files will work, but others won't. I just tested it, the file
from

https://issues.apache.org/jira/browse/PDFBOX-2163 fails.



hasNoFollowingBinData() tries to find out whether "EI" is followed by
typical PDF operators (then it means the "EI" is "real"), or by binary data
(then it means that this "EI" is itself binary data). You can't share the
documents, but you could try to extract the content stream with PDFDebugger:





, then go to the offset mentioned (on windows use NOTEPAD++) and copy a
few bytes and post them here (make sure not to miss "invisible" characters,
look at the hex codes). Because you said it is better without the 3 lines,
this means that noBinData is true, thus the "EI" was really the end. It
would be interesting to see the next 10 bytes after EI. If you can't share
these bytes, then consider suggesting a code change.

Tilman





On 14.02.2025 17:00, Petersam, John Contractor wrote:

Hello,

My system, which is running PDFBox 3.0.3, processes about 325k PDF documents every day, 
so we see a lot of odd things.  Of those, about 600 will have "ignoring 'EI' assumed 
to be in the middle of inline image at stream offset" in their log files.  Most of 
the time, this warning simply results in a random line being displayed and does not 
affect our user experience.  However, sometimes it results in the entirety of our text 
not being displayed.



The documents do display properly in Acrobat, but have the same issue with 
pdf.js as well as the Python libraries I have tried.  In an effort to learn 
more about the issue, I downloaded the source code and determined that the PDFs 
are failing in the hasNoFollowingBinData function in PDFStreamParser.



                 if (endOpIdx != -1 && startOpIdx != -1 && endOpIdx - startOpIdx 
> 3)

                 {

                     noBinData = false;

                 }



What's interesting is that if I comment those lines out, I am able to render 
the PDF properly in PDFBox, which in turn allows me create a document which is 
can then be rendered by the other libraries.



I would like to know what this code is doing, and what the ramifications are 
for removing it.  After all, it was written for a reason and the other 
libraries appear to have some similar check.



Any help would be greatly appreciated.   Unfortunately, due to PII reasons I 
cannot share the original documents.



Thanks,

John Petersam







---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to