Re: Inline image bug with multi-byte newline tokens

Tilman Hausherr Mon, 21 Apr 2025 21:32:17 -0700

Hi,

I approved your JIRA registration. Parsing inline images has been a hugepain for years :-( Your link doesn't work, please attach your PDF in theissue when creating it.


Tilman

On 21.04.2025 22:22, Plate, Ben wrote:

Hello,

I'm writing to report a bug with Apache PDFBox's handling of inline images. 
Specifically, it seems that inline image streams which are written with 
multi-byte whitespace tokens are improperly parsed by PDFBox such that 
whitespace characters are prepended to the start of the image stream. I've 
attached a URL to an example PDF at the bottom. Please let me know if you're 
having trouble accessing it.

When converting the PDF to an image, I get the following error:

Exception occurred while converting page at index [0] 
javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected 
SOI: 0xffd8)

The exception comes from Twelvemonkeys but there is nothing wrong with the 
image itself or that library. The image is zlib-compressed and inserted into 
the PDF, but if you open the PDF, extract the zlib-compressed object, and look 
at the hex data of the start of the image, you see included the following data:

4944 0d0a ffd8

This corresponds to the 'ID' token, followed by a newline, followed by the 
start of the object stream, which is a JPEG image as indicated by the SOI 
indicator 'ffd8'. However, you'll notice that the new line character is 
multi-byte: it's a carriage return followed by a line feed character. According 
to the PDF specification ISO 32000-2:

"The PDF character set is divided into three classes referred to as regular, 
delimiter, and white-space characters. This classification enables the grouping of 
characters into tokens...PDF treats any sequence of consecutive whitespace characters, 
not inside of a string or stream, as one character."

Furthermore, regarding the section on inline images:

"The bytes between the ID operator and a whitespace token, but before the EI operator shall be 
treated the same as a stream object’s data (see 7.3.8, "Stream objects"), even though 
they do not follow the standard stream syntax."

This means that an arbitrary number of whitespace bytes can appear between the 
'ID' token and the start of the image stream according to the PDF 
specification. However, in the PDFStreamParser class, we observe the following 
logic for stripping this whitespace:

if( isWhitespace() )
{
     //pull off the whitespace character
     source.read();
}

This assumes a single-byte newline character, but does not properly handle 
multi-byte newlines. As such, in the example the 0d character gets skipped, but 
the 0a character isn't and is included in the imageData byte array, and when 
passed on to twelvemonkeys an exception is thrown.

The solution here is to simply replace the if statement with a while loop.

PDF for bug reproduction:
https://apache-pdfbox-whitespace-inline-image-bug.s3.us-west-2.amazonaws.com/out.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAWRP56XU47ECHCOVT%2F20250421%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20250421T195746Z&X-Amz-Expires=604799&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDwaCXVzLXdlc3QtMiJHMEUCIQC2FdeA%2Bqd6sPDLQvb8r%2BZ1SWJoTg2uvecehUgxRYxepwIgZ0igbA1eQNbfFybkcC0%2FQnXXKE32a%2BPUN2%2Bkak2O1OoqzQQIxf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw0NDk4OTM1NDczMjEiDNsjzlSpMqflD3C7rSqhBFRKQVYIvKoruaeBcL4MMoBpkMbvovxrX6dzoMPNkbHmVyICSD2tC9Ce%2FwhkVbRJiSro2%2BeHXpAXzl3dUDFgnASnO90QexHVtwls%2BcnUajXod0QrMyktydi7IDuJUNsgredt0IOyasQmj%2FpQRmp5p1QV4lQGOixAqIhm5aOfiOrxBGcgfNuTTAOQFBFkCPBEZUoENHnY3ek8vZQfQAqsOQ95HpK45WvpRMjrCDJ%2FF%2FsDY8SLmx8KfAISDffi2W5xTaS7osHSTzJ52V%2Bv%2BpaB8SbKHm1UWIU3MLRcQ0tO0JxDVhOjIhGMwn5KnNl1Ws0DmLYMxoYR9E9cuh1zMiqZSwhNMmgK5YChIhgm9vwXrPLlOkMH6yvX4x5pk5VFPHx0nUpUYRPn5tV2YpU2aCk%2BR9iZz7AtpvOPrNnIBJcxwlyioPgSrhLoKQs4TVAhZlxlAy%2B3xmiN7%2FBv%2FOGN4%2FL4KRaReSXQMIB5IOLQm9AutEq5XlmhKgnWR%2FoYUG0TOOWVViPGh0USvpFAqj%2Bj8TKO95EcsTo3bGCtu3i9j0Inmq7YnK%2BY2Uysrc5MQXHsEnKCKukK2mppfTpZ6pmYMZy6NJ%2BGPc8ojnnKTLTIniHsHQzdeKXlOZL4JxwW5Xw1prrGddmNDHpBGU7d963xjepagfa1uV7yP9IzrcVaMKWOlLmXfe4Ex95w6eu6DS5s5tDJy1W%2FZTlisbocCOWEBnE870MuMLDAmsAGOqEBaJUk3u9GyIAyrACoJXWAkdoF4p35EUlWdT0RQhEfTTGwP%2BHAN3iGPN6ZP7Sg2csmIByKO2cwAStjCUC6dZQS5m5z%2B8LqHaVpc5sgYjNjhbJuxnlgTIO0k0gm9EjeGBPU7yWyg2g3fCSE9dDvRupK0lMLPOXZU9LV3at5km8b9VAqE1NBBQS5kVXY1EmlQ8dnkmJhPEzpd%2FgsQKQFt10leQA%3D&X-Amz-Signature=7b7dbb1ff10136d3c16597e4214a61778ebf82bc96e8a972d581d626d43741f4




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Inline image bug with multi-byte newline tokens

Reply via email to