Re: pdfobjectstreamparser fail to parse content

mountain the blue Fri, 21 Feb 2025 00:51:03 -0800

Hi Andreas,

Currently, for each object, I do read it in byte[] and uses that bytes as
content for new parser.
For last referenced object, though, I leave the default stream for reading.


Let me know if I can help you sharing the update I performed: currently on
a different branch but if you prefer, I  can apply the changes to the
trunk….
Let me know what best help you.


On Fri 21 Feb 2025 at 08:28, Andreas Lehmkühler <andr...@lehmi.de.invalid>
wrote:

> Thanks for the sample data. I got your point.
>
> I'm afraid it isn't that easy to fix.
>
> It's easy to split up the stream in pieces based on the start offsets of
> the trailing data. But the parser doesn't work like that, it simply
> reads from a certain starting point and stops at the end of an object.
> And another issue might be that there will some pdfs with broken
> offsets, so that it might be a bad idea to work with splitted.
>
> However, let me thing about it. Maybe I'll come up with an idea
>
> Andreas
>
> Am 20.02.25 um 10:07 schrieb mountain the blue:
> > Hi Andreas,
> >
> > I forgot to tell ...
> >
> > looking at the contained 13679_stream.dat, we see the first offsets for
> the
> > objects ...
> >
> > 5550 0 5551 5 5552 *7* 5553 *11* 5554 16
> >
> > that actual data starting at offset 1990 (see the /First 1990 in the
> > 13679_objstm.raw).
> > looking at the content at offset 1990 in 13679_stream.dat, we have ...
> >
> > offset - 1990: 0       7         11
> > data  :        ... 3505 4 248*03*505 ...
> >
> > as we see, the bold '0' is the end of object 5552 ... while object 5553
> > starts at bolded '3': no separator between both tokens
> >
> >
> > On Wed, Feb 19, 2025 at 9:36 PM mountain the blue <
> thebluemount...@gmail.com>
> > wrote:
> >
> >> Hi Andreas,
> >>
> >> Sorry for this delayed response.
> >>
> >> re: "Of course. it is better than nothing"
> >>
> >> I have uploaded a zip file carrying:
> >> 1- the raw extract of the /ObjStream object and related stream content
> >> 2- the decode content stream
> >>
> >> it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj
> >>
> >> re: "Old version of what major version?..."
> >> Yes, this is a 2.x base
> >> ... that borrowed some of the 3.x parsing you was working on as it was
> >> solving numerous issues we were encountering then
> >>
> >>
> >> On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler
> >> <andr...@lehmi.de.invalid> wrote:
> >>
> >>>
> >>>
> >>> Am 17.02.25 um 22:16 schrieb mountain the blue:
> >>>> hi Andreas,
> >>>>
> >>>> re: 'is there any chance ...'
> >>>> I would have to ask for authorisation to the owner (a company) and I
> >>> doubt
> >>>> I could have it sent quickly.
> >>>> I can, though, share the actual /ObjStm content, decompressed; let me
> >>> know
> >>>> if this would help you.
> >>> Of course. it is better than nothing
> >>>
> >>>> re: "which version ..."
> >>>> I am using an old version ... (that I am patching myself) ...
> >>>> I can, however, reproduce it with current code on the trunk branch ...
> >>>> (therefore, the 2 unit tests to exhibit the current behavior)
> >>> Old version of what major version? 2.x or 3.x?
> >>>
> >>>
> >>>> On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler
> >>> <andr...@lehmi.de.invalid>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> is there any chance to get a hand on the pdf in question?
> >>>>>
> >>>>> Which version pd PDFBox are you using?
> >>>>>
> >>>>> Andreas
> >>>>>
> >>>>> Am 17.02.25 um 17:16 schrieb mountain the blue:
> >>>>>> hi,
> >>>>>>
> >>>>>> first of all, many thanks for the contributors of the pdfbox project
> >>> that
> >>>>>> I've been using for long time for anything relating to pdf in java.
> >>>>>>
> >>>>>> I am using pdfbox to process various pdf files.
> >>>>>> lately, I received a file whose parsing failed:
> >>>>>> ie:
> >>>>>> ...
> >>>>>> Exception in thread "main" java.io.IOException: Error: Unknown
> >>> annotation
> >>>>>> type COSInt{49633506}
> >>>>>> at
> >>>>>>
> >>>>>
> >>>
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
> >>>>>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
> >>>>>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
> >>>>>> ...
> >>>>>>
> >>>>>> Looking further into this error, the reason was coming from the
> >>> parsing
> >>>>> of
> >>>>>> /ObjStm ... that expects each object, serialised in the stream, to
> >>> have
> >>>>>> separator (ie; white space) while the
> >>>>>> pdf was having some COS object serialised without the such
> separation
> >>>>>>
> >>>>>> in the current code base, accessible on GitHub, the following test
> >>>>> passes:
> >>>>>>
> >>>>>> @Test
> >>>>>> void testParse2NumberObjects () throws IOException
> >>>>>> {
> >>>>>>        COSStream stream = new COSStream();
> >>>>>>        stream.setItem(COSName.N, COSInteger.TWO);
> >>>>>>        stream.setItem(COSName.FIRST, COSInteger.get(8));
> >>>>>>        OutputStream outputStream = stream.createOutputStream();
> >>>>>>        outputStream.write("6 0 4 2 1 2".getBytes());
> >>>>>>        outputStream.close();
> >>>>>>        PDFObjectStreamParser objectStreamParser = new
> >>>>>> PDFObjectStreamParser(stream, null);
> >>>>>>        Map<COSObjectKey, COSBase> objectNumbers =
> >>>>>> objectStreamParser.parseAllObjects();
> >>>>>>        assertEquals(2, objectNumbers.size());
> >>>>>>        assertEquals(COSInteger.get (1), objectNumbers.get(new
> >>>>> COSObjectKey(6, 0)));
> >>>>>>        assertEquals(COSInteger.get (2), objectNumbers.get(new
> >>>>> COSObjectKey(4, 0)));
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> while this one fails:
> >>>>>>
> >>>>>> @Test
> >>>>>> void testParse2NumberObjectsNoSpace () throws IOException
> >>>>>> {
> >>>>>>        COSStream stream = new COSStream();
> >>>>>>        stream.setItem(COSName.N, COSInteger.TWO);
> >>>>>>        stream.setItem(COSName.FIRST, COSInteger.get(8));
> >>>>>>        OutputStream outputStream = stream.createOutputStream();
> >>>>>>        outputStream.write("6 0 4 *1* *12*".getBytes());
> >>>>>>        outputStream.close();
> >>>>>>        PDFObjectStreamParser objectStreamParser = new
> >>>>>> PDFObjectStreamParser(stream, null);
> >>>>>>        Map<COSObjectKey, COSBase> objectNumbers =
> >>>>>> objectStreamParser.parseAllObjects();
> >>>>>>        assertEquals(2, objectNumbers.size());
> >>>>>>        assertEquals(COSInteger.get (1), objectNumbers.get(new
> >>>>> COSObjectKey(6, 0)));
> >>>>>>        assertEquals(COSInteger.get (2), objectNumbers.get(new
> >>>>> COSObjectKey(4, 0)));
> >>>>>> }
> >>>>>>
> >>>>>> with error:
> >>>>>> org.opentest4j.AssertionFailedError:
> >>>>>> Expected :COSInt{*1*}
> >>>>>> Actual   :COSInt{*12*}
> >>>>>>
> >>>>>> at
> >>>>>>
> >>>>>
> >>>
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> >>>>>> ...
> >>>>>> at
> >>>>>>
> >>>>>
> >>>
> org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
> >>>>>> ...
> >>>>>> notes:
> >>>>>>
> >>>>>> a- the second object (number = 4), now indicates 1 as its offset and
> >>> both
> >>>>>> '1' and '2' are now 'joined'.
> >>>>>>
> >>>>>> b- the file was being created by on November last year and converted
> >>> from
> >>>>>> word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture
> Plug-in':
> >>> I
> >>>>> do
> >>>>>> expect to see such (valid) pdf construction more often in the (near)
> >>>>> future.
> >>>>>>
> >>>>>> @ (Tilman & Andreas): I was able to have the pdfbox working by
> >>> changing
> >>>>> the
> >>>>>> PDFObjectStreamParser implementation, rewriting the
> >>>>>> privateReadObjectOffsets() method to return an array and using a
> >>> parser
> >>>>>> that does not parse beyond implicit limitation given by next
> object's
> >>>>>> offset. let me know if you want to access this change.
> >>>>>>
> >>>>>> thank you,
> >>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: pdfobjectstreamparser fail to parse content

Reply via email to