Hello,
My system, which is running PDFBox 3.0.3, processes about 325k PDF documents 
every day, so we see a lot of odd things.  Of those, about 600 will have 
"ignoring 'EI' assumed to be in the middle of inline image at stream offset" in 
their log files.  Most of the time, this warning simply results in a random 
line being displayed and does not affect our user experience.  However, 
sometimes it results in the entirety of our text not being displayed.

The documents do display properly in Acrobat, but have the same issue with 
pdf.js as well as the Python libraries I have tried.  In an effort to learn 
more about the issue, I downloaded the source code and determined that the PDFs 
are failing in the hasNoFollowingBinData function in PDFStreamParser.

                if (endOpIdx != -1 && startOpIdx != -1 && endOpIdx - startOpIdx 
> 3)
                {
                    noBinData = false;
                }

What's interesting is that if I comment those lines out, I am able to render 
the PDF properly in PDFBox, which in turn allows me create a document which is 
can then be rendered by the other libraries.

I would like to know what this code is doing, and what the ramifications are 
for removing it.  After all, it was written for a reason and the other 
libraries appear to have some similar check.

Any help would be greatly appreciated.   Unfortunately, due to PII reasons I 
cannot share the original documents.

Thanks,
John Petersam

Reply via email to