Yeah, trying to scan for the end of the image data with regular state
machine scanning was never reliable enough (I spent a week trying to make
it work).

On my other project, I wound up actually decoding the embedded image bytes
(i.e. doing a proper image decode). So on the state machine, when I hit a
BI, I immediately started an image decode instead of attempting to parse
arguments and operands.

This resulted in the image data having to be decoded twice (once to
determine its length, then again later during rendering). The second decode
could be avoided by adding a synthetic operand value for the parsed Image,
I suppose, but I decided if someone was using BI/EI at all they deserved
poor performance 😁

K


On Fri, Feb 14, 2025, 11:32 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> On 14.02.2025 19:06, Kevin Day wrote:
> > I'm sure this isn't helpful, but I can't help myself. Inline image
> streams
> > are one of the worst aspects of the PDF spec. Just absolutely awful. You
> > basically have to do a full decode on the embedded stream to figure out
> > where it ends, because the stream itself could contain the letters EI. So
> > they took a token oriented state machine parser and shoved a binary
> stream
> > operation in the middle of it that breaks the token processing.
> >
> > /End Rant
> >
> > I will be interested in where that is what is going on here (image stream
> > happens to contain EI).
> >
> > Tillman - back when I was writing PDF content stream parsers, actually
> > doing the decoding was the only way I could find to handle BI/EI
> reliably.
> > I'm happy to be a resource if you want someone to bounce ideas around
> with.
>
> Yes it's a terrible mess and the code to survive this isn't nice to look
> at. It's in hasNoFollowingBinData() in PDFStreamParser.
>
> Tilman
>
>
>
> >
> >
> >
> >
> > On Fri, Feb 14, 2025, 9:53 AM Petersam, John Contractor
> > <john.peter...@ssa.gov.invalid> wrote:
> >
> >> Hi Tilman,
> >>
> >> First, thanks for the quick response.  I follow several Apache group
> email
> >> threads and you are by far the fastest responder in any of them.
> Actually,
> >> that alone doesn’t do you justice.  You’re a total ace and I (along with
> >> everyone) really appreciate you and what you do for this community.
> >>
> >>
> >>
> >> I followed your instructions and here is the EI and what immediately
> >> follows it:
> >>
> >>
> >>
> >> EI
> >>
> >> 0.00374532 0 0 0.0243902 0 -79.4878 cm
> >>
> >> 0 3124 25 45  re
> >>
> >> f
> >>
> >> 0 0 0 rg
> >>
> >> 0.5 3150.5 2482 63  re
> >>
> >> f*
> >>
> >> 2483 0 0 64 0 3150 cm
> >>
> >> 1 1 1 rg
> >>
> >> /Im0 Do
> >>
> >> 0.000402739 0 0 0.015625 0 -49.2188 cm
> >>
> >> 0 3062 25 45  re
> >>
> >> f
> >>
> >> 0 0 0 rg
> >>
> >> 0.5 3088.5 2482 63  re
> >>
> >> f*
> >>
> >> 2483 0 0 64 0 3088 cm
> >>
> >> 1 1 1 rg
> >>
> >> /Im1 Do
> >>
> >>
> >>
> >> Please let me know if there is anything else I can do to help.
> >>
> >>
> >>
> >> Thanks again,
> >>
> >> John Petersam
> >>
> >>
> >>
> >> *From:* Tilman Hausherr <thaush...@t-online.de>
> >> *Sent:* Friday, February 14, 2025 11:34 AM
> >> *To:* users@pdfbox.apache.org
> >> *Subject:* [EXTERNAL] Re: ignoring 'EI' warning
> >>
> >>
> >>
> >> Hi,
> >>
> >> Maybe these files will work, but others won't. I just tested it, the
> file
> >> from
> >>
> >> https://issues.apache.org/jira/browse/PDFBOX-2163 fails.
> >>
> >>
> >>
> >> hasNoFollowingBinData() tries to find out whether "EI" is followed by
> >> typical PDF operators (then it means the "EI" is "real"), or by binary
> data
> >> (then it means that this "EI" is itself binary data). You can't share
> the
> >> documents, but you could try to extract the content stream with
> PDFDebugger:
> >>
> >>
> >>
> >>
> >>
> >> , then go to the offset mentioned (on windows use NOTEPAD++) and copy a
> >> few bytes and post them here (make sure not to miss "invisible"
> characters,
> >> look at the hex codes). Because you said it is better without the 3
> lines,
> >> this means that noBinData is true, thus the "EI" was really the end. It
> >> would be interesting to see the next 10 bytes after EI. If you can't
> share
> >> these bytes, then consider suggesting a code change.
> >>
> >> Tilman
> >>
> >>
> >>
> >>
> >>
> >> On 14.02.2025 17:00, Petersam, John Contractor wrote:
> >>
> >> Hello,
> >>
> >> My system, which is running PDFBox 3.0.3, processes about 325k PDF
> documents every day, so we see a lot of odd things.  Of those, about 600
> will have "ignoring 'EI' assumed to be in the middle of inline image at
> stream offset" in their log files.  Most of the time, this warning simply
> results in a random line being displayed and does not affect our user
> experience.  However, sometimes it results in the entirety of our text not
> being displayed.
> >>
> >>
> >>
> >> The documents do display properly in Acrobat, but have the same issue
> with pdf.js as well as the Python libraries I have tried.  In an effort to
> learn more about the issue, I downloaded the source code and determined
> that the PDFs are failing in the hasNoFollowingBinData function in
> PDFStreamParser.
> >>
> >>
> >>
> >>                  if (endOpIdx != -1 && startOpIdx != -1 && endOpIdx -
> startOpIdx > 3)
> >>
> >>                  {
> >>
> >>                      noBinData = false;
> >>
> >>                  }
> >>
> >>
> >>
> >> What's interesting is that if I comment those lines out, I am able to
> render the PDF properly in PDFBox, which in turn allows me create a
> document which is can then be rendered by the other libraries.
> >>
> >>
> >>
> >> I would like to know what this code is doing, and what the
> ramifications are for removing it.  After all, it was written for a reason
> and the other libraries appear to have some similar check.
> >>
> >>
> >>
> >> Any help would be greatly appreciated.   Unfortunately, due to PII
> reasons I cannot share the original documents.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> John Petersam
> >>
> >>
> >>
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to