Re: pdfobjectstreamparser fail to parse content

mountain the blue Wed, 19 Feb 2025 12:36:24 -0800

Hi Andreas,

Sorry for this delayed response.


re: "Of course. it is better than nothing"

I have uploaded a zip file carrying:
1- the raw extract of the /ObjStream object and related stream content
2- the decode content stream

it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj

re: "Old version of what major version?..."
Yes, this is a 2.x base
... that borrowed some of the 3.x parsing you was working on as it was
solving numerous issues we were encountering then


On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler <andr...@lehmi.de.invalid>
wrote:

>
>
> Am 17.02.25 um 22:16 schrieb mountain the blue:
> > hi Andreas,
> >
> > re: 'is there any chance ...'
> > I would have to ask for authorisation to the owner (a company) and I
> doubt
> > I could have it sent quickly.
> > I can, though, share the actual /ObjStm content, decompressed; let me
> know
> > if this would help you.
> Of course. it is better than nothing
>
> > re: "which version ..."
> > I am using an old version ... (that I am patching myself) ...
> > I can, however, reproduce it with current code on the trunk branch ...
> > (therefore, the 2 unit tests to exhibit the current behavior)
> Old version of what major version? 2.x or 3.x?
>
>
> > On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler
> <andr...@lehmi.de.invalid>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to get a hand on the pdf in question?
> >>
> >> Which version pd PDFBox are you using?
> >>
> >> Andreas
> >>
> >> Am 17.02.25 um 17:16 schrieb mountain the blue:
> >>> hi,
> >>>
> >>> first of all, many thanks for the contributors of the pdfbox project
> that
> >>> I've been using for long time for anything relating to pdf in java.
> >>>
> >>> I am using pdfbox to process various pdf files.
> >>> lately, I received a file whose parsing failed:
> >>> ie:
> >>> ...
> >>> Exception in thread "main" java.io.IOException: Error: Unknown
> annotation
> >>> type COSInt{49633506}
> >>> at
> >>>
> >>
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
> >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
> >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
> >>> ...
> >>>
> >>> Looking further into this error, the reason was coming from the parsing
> >> of
> >>> /ObjStm ... that expects each object, serialised in the stream, to have
> >>> separator (ie; white space) while the
> >>> pdf was having some COS object serialised without the such separation
> >>>
> >>> in the current code base, accessible on GitHub, the following test
> >> passes:
> >>>
> >>> @Test
> >>> void testParse2NumberObjects () throws IOException
> >>> {
> >>>       COSStream stream = new COSStream();
> >>>       stream.setItem(COSName.N, COSInteger.TWO);
> >>>       stream.setItem(COSName.FIRST, COSInteger.get(8));
> >>>       OutputStream outputStream = stream.createOutputStream();
> >>>       outputStream.write("6 0 4 2 1 2".getBytes());
> >>>       outputStream.close();
> >>>       PDFObjectStreamParser objectStreamParser = new
> >>> PDFObjectStreamParser(stream, null);
> >>>       Map<COSObjectKey, COSBase> objectNumbers =
> >>> objectStreamParser.parseAllObjects();
> >>>       assertEquals(2, objectNumbers.size());
> >>>       assertEquals(COSInteger.get (1), objectNumbers.get(new
> >> COSObjectKey(6, 0)));
> >>>       assertEquals(COSInteger.get (2), objectNumbers.get(new
> >> COSObjectKey(4, 0)));
> >>> }
> >>>
> >>>
> >>> while this one fails:
> >>>
> >>> @Test
> >>> void testParse2NumberObjectsNoSpace () throws IOException
> >>> {
> >>>       COSStream stream = new COSStream();
> >>>       stream.setItem(COSName.N, COSInteger.TWO);
> >>>       stream.setItem(COSName.FIRST, COSInteger.get(8));
> >>>       OutputStream outputStream = stream.createOutputStream();
> >>>       outputStream.write("6 0 4 *1* *12*".getBytes());
> >>>       outputStream.close();
> >>>       PDFObjectStreamParser objectStreamParser = new
> >>> PDFObjectStreamParser(stream, null);
> >>>       Map<COSObjectKey, COSBase> objectNumbers =
> >>> objectStreamParser.parseAllObjects();
> >>>       assertEquals(2, objectNumbers.size());
> >>>       assertEquals(COSInteger.get (1), objectNumbers.get(new
> >> COSObjectKey(6, 0)));
> >>>       assertEquals(COSInteger.get (2), objectNumbers.get(new
> >> COSObjectKey(4, 0)));
> >>> }
> >>>
> >>> with error:
> >>> org.opentest4j.AssertionFailedError:
> >>> Expected :COSInt{*1*}
> >>> Actual   :COSInt{*12*}
> >>>
> >>> at
> >>>
> >>
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> >>> ...
> >>> at
> >>>
> >>
> org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
> >>> ...
> >>> notes:
> >>>
> >>> a- the second object (number = 4), now indicates 1 as its offset and
> both
> >>> '1' and '2' are now 'joined'.
> >>>
> >>> b- the file was being created by on November last year and converted
> from
> >>> word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in': I
> >> do
> >>> expect to see such (valid) pdf construction more often in the (near)
> >> future.
> >>>
> >>> @ (Tilman & Andreas): I was able to have the pdfbox working by changing
> >> the
> >>> PDFObjectStreamParser implementation, rewriting the
> >>> privateReadObjectOffsets() method to return an array and using a parser
> >>> that does not parse beyond implicit limitation given by next object's
> >>> offset. let me know if you want to access this change.
> >>>
> >>> thank you,
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: pdfobjectstreamparser fail to parse content

Reply via email to