I'm sure this isn't helpful, but I can't help myself. Inline image streams are one of the worst aspects of the PDF spec. Just absolutely awful. You basically have to do a full decode on the embedded stream to figure out where it ends, because the stream itself could contain the letters EI. So they took a token oriented state machine parser and shoved a binary stream operation in the middle of it that breaks the token processing.
/End Rant I will be interested in where that is what is going on here (image stream happens to contain EI). Tillman - back when I was writing PDF content stream parsers, actually doing the decoding was the only way I could find to handle BI/EI reliably. I'm happy to be a resource if you want someone to bounce ideas around with. On Fri, Feb 14, 2025, 9:53 AM Petersam, John Contractor <john.peter...@ssa.gov.invalid> wrote: > Hi Tilman, > > First, thanks for the quick response. I follow several Apache group email > threads and you are by far the fastest responder in any of them. Actually, > that alone doesn’t do you justice. You’re a total ace and I (along with > everyone) really appreciate you and what you do for this community. > > > > I followed your instructions and here is the EI and what immediately > follows it: > > > > EI > > 0.00374532 0 0 0.0243902 0 -79.4878 cm > > 0 3124 25 45 re > > f > > 0 0 0 rg > > 0.5 3150.5 2482 63 re > > f* > > 2483 0 0 64 0 3150 cm > > 1 1 1 rg > > /Im0 Do > > 0.000402739 0 0 0.015625 0 -49.2188 cm > > 0 3062 25 45 re > > f > > 0 0 0 rg > > 0.5 3088.5 2482 63 re > > f* > > 2483 0 0 64 0 3088 cm > > 1 1 1 rg > > /Im1 Do > > > > Please let me know if there is anything else I can do to help. > > > > Thanks again, > > John Petersam > > > > *From:* Tilman Hausherr <thaush...@t-online.de> > *Sent:* Friday, February 14, 2025 11:34 AM > *To:* users@pdfbox.apache.org > *Subject:* [EXTERNAL] Re: ignoring 'EI' warning > > > > Hi, > > Maybe these files will work, but others won't. I just tested it, the file > from > > https://issues.apache.org/jira/browse/PDFBOX-2163 fails. > > > > hasNoFollowingBinData() tries to find out whether "EI" is followed by > typical PDF operators (then it means the "EI" is "real"), or by binary data > (then it means that this "EI" is itself binary data). You can't share the > documents, but you could try to extract the content stream with PDFDebugger: > > > > > > , then go to the offset mentioned (on windows use NOTEPAD++) and copy a > few bytes and post them here (make sure not to miss "invisible" characters, > look at the hex codes). Because you said it is better without the 3 lines, > this means that noBinData is true, thus the "EI" was really the end. It > would be interesting to see the next 10 bytes after EI. If you can't share > these bytes, then consider suggesting a code change. > > Tilman > > > > > > On 14.02.2025 17:00, Petersam, John Contractor wrote: > > Hello, > > My system, which is running PDFBox 3.0.3, processes about 325k PDF documents > every day, so we see a lot of odd things. Of those, about 600 will have > "ignoring 'EI' assumed to be in the middle of inline image at stream offset" > in their log files. Most of the time, this warning simply results in a > random line being displayed and does not affect our user experience. > However, sometimes it results in the entirety of our text not being displayed. > > > > The documents do display properly in Acrobat, but have the same issue with > pdf.js as well as the Python libraries I have tried. In an effort to learn > more about the issue, I downloaded the source code and determined that the > PDFs are failing in the hasNoFollowingBinData function in PDFStreamParser. > > > > if (endOpIdx != -1 && startOpIdx != -1 && endOpIdx - > startOpIdx > 3) > > { > > noBinData = false; > > } > > > > What's interesting is that if I comment those lines out, I am able to render > the PDF properly in PDFBox, which in turn allows me create a document which > is can then be rendered by the other libraries. > > > > I would like to know what this code is doing, and what the ramifications are > for removing it. After all, it was written for a reason and the other > libraries appear to have some similar check. > > > > Any help would be greatly appreciated. Unfortunately, due to PII reasons I > cannot share the original documents. > > > > Thanks, > > John Petersam > > > > >