Re: pdfobjectstreamparser fail to parse content

Andreas Lehmkühler Thu, 20 Feb 2025 23:28:29 -0800

Thanks for the sample data. I got your point.

I'm afraid it isn't that easy to fix.

It's easy to split up the stream in pieces based on the start offsets ofthe trailing data. But the parser doesn't work like that, it simplyreads from a certain starting point and stops at the end of an object.And another issue might be that there will some pdfs with brokenoffsets, so that it might be a bad idea to work with splitted.


However, let me thing about it. Maybe I'll come up with an idea

Andreas

Am 20.02.25 um 10:07 schrieb mountain the blue:

Hi Andreas,

I forgot to tell ...

looking at the contained 13679_stream.dat, we see the first offsets for the
objects ...

5550 0 5551 5 5552 *7* 5553 *11* 5554 16

that actual data starting at offset 1990 (see the /First 1990 in the
13679_objstm.raw).
looking at the content at offset 1990 in 13679_stream.dat, we have ...

offset - 1990: 0       7         11
data  :        ... 3505 4 248*03*505 ...

as we see, the bold '0' is the end of object 5552 ... while object 5553
starts at bolded '3': no separator between both tokens


On Wed, Feb 19, 2025 at 9:36 PM mountain the blue <[email protected]>
wrote:

Hi Andreas,

Sorry for this delayed response.

re: "Of course. it is better than nothing"

I have uploaded a zip file carrying:
1- the raw extract of the /ObjStream object and related stream content
2- the decode content stream

it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj

re: "Old version of what major version?..."
Yes, this is a 2.x base
... that borrowed some of the 3.x parsing you was working on as it was
solving numerous issues we were encountering then


On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler
<[email protected]> wrote:



Am 17.02.25 um 22:16 schrieb mountain the blue:

hi Andreas,

re: 'is there any chance ...'
I would have to ask for authorisation to the owner (a company) and I

doubt

I could have it sent quickly.
I can, though, share the actual /ObjStm content, decompressed; let me

know

if this would help you.

Of course. it is better than nothing

re: "which version ..."
I am using an old version ... (that I am patching myself) ...
I can, however, reproduce it with current code on the trunk branch ...
(therefore, the 2 unit tests to exhibit the current behavior)

Old version of what major version? 2.x or 3.x?

On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler

<[email protected]>

wrote:

Hi,

is there any chance to get a hand on the pdf in question?

Which version pd PDFBox are you using?

Andreas

Am 17.02.25 um 17:16 schrieb mountain the blue:

hi,

first of all, many thanks for the contributors of the pdfbox project

that

I've been using for long time for anything relating to pdf in java.

I am using pdfbox to process various pdf files.
lately, I received a file whose parsing failed:
ie:
...
Exception in thread "main" java.io.IOException: Error: Unknown

annotation

type COSInt{49633506}
at

org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)

at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
...

Looking further into this error, the reason was coming from the

parsing

of

/ObjStm ... that expects each object, serialised in the stream, to

have

separator (ie; white space) while the
pdf was having some COS object serialised without the such separation

in the current code base, accessible on GitHub, the following test

passes:


@Test
void testParse2NumberObjects () throws IOException
{
       COSStream stream = new COSStream();
       stream.setItem(COSName.N, COSInteger.TWO);
       stream.setItem(COSName.FIRST, COSInteger.get(8));
       OutputStream outputStream = stream.createOutputStream();
       outputStream.write("6 0 4 2 1 2".getBytes());
       outputStream.close();
       PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
       Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
       assertEquals(2, objectNumbers.size());
       assertEquals(COSInteger.get (1), objectNumbers.get(new

COSObjectKey(6, 0)));

       assertEquals(COSInteger.get (2), objectNumbers.get(new

COSObjectKey(4, 0)));

}


while this one fails:

@Test
void testParse2NumberObjectsNoSpace () throws IOException
{
       COSStream stream = new COSStream();
       stream.setItem(COSName.N, COSInteger.TWO);
       stream.setItem(COSName.FIRST, COSInteger.get(8));
       OutputStream outputStream = stream.createOutputStream();
       outputStream.write("6 0 4 *1* *12*".getBytes());
       outputStream.close();
       PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
       Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
       assertEquals(2, objectNumbers.size());
       assertEquals(COSInteger.get (1), objectNumbers.get(new

COSObjectKey(6, 0)));

       assertEquals(COSInteger.get (2), objectNumbers.get(new

COSObjectKey(4, 0)));

}

with error:
org.opentest4j.AssertionFailedError:
Expected :COSInt{*1*}
Actual   :COSInt{*12*}

at

org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)

...
at

org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)

...
notes:

a- the second object (number = 4), now indicates 1 as its offset and

both

'1' and '2' are now 'joined'.

b- the file was being created by on November last year and converted

from

word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in':

do

expect to see such (valid) pdf construction more often in the (near)

future.


@ (Tilman & Andreas): I was able to have the pdfbox working by

changing

the

PDFObjectStreamParser implementation, rewriting the
privateReadObjectOffsets() method to return an array and using a

parser

that does not parse beyond implicit limitation given by next object's
offset. let me know if you want to access this change.

thank you,



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: pdfobjectstreamparser fail to parse content

Reply via email to