I'm sure this isn't helpful, but I can't help myself. Inline image streams
are one of the worst aspects of the PDF spec. Just absolutely awful. You
basically have to do a full decode on the embedded stream to figure out
where it ends, because the stream itself could contain the letters EI. So
they took a token oriented state machine parser and shoved a binary stream
operation in the middle of it that breaks the token processing.

/End Rant

I will be interested in where that is what is going on here (image stream
happens to contain EI).

Tillman - back when I was writing PDF content stream parsers, actually
doing the decoding was the only way I could find to handle BI/EI reliably.
I'm happy to be a resource if you want someone to bounce ideas around with.




On Fri, Feb 14, 2025, 9:53 AM Petersam, John Contractor
<john.peter...@ssa.gov.invalid> wrote:

> Hi Tilman,
>
> First, thanks for the quick response.  I follow several Apache group email
> threads and you are by far the fastest responder in any of them.  Actually,
> that alone doesn’t do you justice.  You’re a total ace and I (along with
> everyone) really appreciate you and what you do for this community.
>
>
>
> I followed your instructions and here is the EI and what immediately
> follows it:
>
>
>
> EI
>
> 0.00374532 0 0 0.0243902 0 -79.4878 cm
>
> 0 3124 25 45  re
>
> f
>
> 0 0 0 rg
>
> 0.5 3150.5 2482 63  re
>
> f*
>
> 2483 0 0 64 0 3150 cm
>
> 1 1 1 rg
>
> /Im0 Do
>
> 0.000402739 0 0 0.015625 0 -49.2188 cm
>
> 0 3062 25 45  re
>
> f
>
> 0 0 0 rg
>
> 0.5 3088.5 2482 63  re
>
> f*
>
> 2483 0 0 64 0 3088 cm
>
> 1 1 1 rg
>
> /Im1 Do
>
>
>
> Please let me know if there is anything else I can do to help.
>
>
>
> Thanks again,
>
> John Petersam
>
>
>
> *From:* Tilman Hausherr <thaush...@t-online.de>
> *Sent:* Friday, February 14, 2025 11:34 AM
> *To:* users@pdfbox.apache.org
> *Subject:* [EXTERNAL] Re: ignoring 'EI' warning
>
>
>
> Hi,
>
> Maybe these files will work, but others won't. I just tested it, the file
> from
>
> https://issues.apache.org/jira/browse/PDFBOX-2163 fails.
>
>
>
> hasNoFollowingBinData() tries to find out whether "EI" is followed by
> typical PDF operators (then it means the "EI" is "real"), or by binary data
> (then it means that this "EI" is itself binary data). You can't share the
> documents, but you could try to extract the content stream with PDFDebugger:
>
>
>
>
>
> , then go to the offset mentioned (on windows use NOTEPAD++) and copy a
> few bytes and post them here (make sure not to miss "invisible" characters,
> look at the hex codes). Because you said it is better without the 3 lines,
> this means that noBinData is true, thus the "EI" was really the end. It
> would be interesting to see the next 10 bytes after EI. If you can't share
> these bytes, then consider suggesting a code change.
>
> Tilman
>
>
>
>
>
> On 14.02.2025 17:00, Petersam, John Contractor wrote:
>
> Hello,
>
> My system, which is running PDFBox 3.0.3, processes about 325k PDF documents 
> every day, so we see a lot of odd things.  Of those, about 600 will have 
> "ignoring 'EI' assumed to be in the middle of inline image at stream offset" 
> in their log files.  Most of the time, this warning simply results in a 
> random line being displayed and does not affect our user experience.  
> However, sometimes it results in the entirety of our text not being displayed.
>
>
>
> The documents do display properly in Acrobat, but have the same issue with 
> pdf.js as well as the Python libraries I have tried.  In an effort to learn 
> more about the issue, I downloaded the source code and determined that the 
> PDFs are failing in the hasNoFollowingBinData function in PDFStreamParser.
>
>
>
>                 if (endOpIdx != -1 && startOpIdx != -1 && endOpIdx - 
> startOpIdx > 3)
>
>                 {
>
>                     noBinData = false;
>
>                 }
>
>
>
> What's interesting is that if I comment those lines out, I am able to render 
> the PDF properly in PDFBox, which in turn allows me create a document which 
> is can then be rendered by the other libraries.
>
>
>
> I would like to know what this code is doing, and what the ramifications are 
> for removing it.  After all, it was written for a reason and the other 
> libraries appear to have some similar check.
>
>
>
> Any help would be greatly appreciated.   Unfortunately, due to PII reasons I 
> cannot share the original documents.
>
>
>
> Thanks,
>
> John Petersam
>
>
>
>
>

Reply via email to