Re: pdfobjectstreamparser fail to parse content

mountain the blue Mon, 17 Feb 2025 13:27:08 -0800

hi Andreas,

re: 'is there any chance ...'
I would have to ask for authorisation to the owner (a company) and I doubt
I could have it sent quickly.
I can, though, share the actual /ObjStm content, decompressed; let me know
if this would help you.


re: "which version ..."
I am using an old version ... (that I am patching myself) ...
I can, however, reproduce it with current code on the trunk branch ...
(therefore, the 2 unit tests to exhibit the current behavior)


On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler <andr...@lehmi.de.invalid>
wrote:

> Hi,
>
> is there any chance to get a hand on the pdf in question?
>
> Which version pd PDFBox are you using?
>
> Andreas
>
> Am 17.02.25 um 17:16 schrieb mountain the blue:
> > hi,
> >
> > first of all, many thanks for the contributors of the pdfbox project that
> > I've been using for long time for anything relating to pdf in java.
> >
> > I am using pdfbox to process various pdf files.
> > lately, I received a file whose parsing failed:
> > ie:
> > ...
> > Exception in thread "main" java.io.IOException: Error: Unknown annotation
> > type COSInt{49633506}
> > at
> >
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
> > at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
> > at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
> > ...
> >
> > Looking further into this error, the reason was coming from the parsing
> of
> > /ObjStm ... that expects each object, serialised in the stream, to have
> > separator (ie; white space) while the
> > pdf was having some COS object serialised without the such separation
> >
> > in the current code base, accessible on GitHub, the following test
> passes:
> >
> > @Test
> > void testParse2NumberObjects () throws IOException
> > {
> >      COSStream stream = new COSStream();
> >      stream.setItem(COSName.N, COSInteger.TWO);
> >      stream.setItem(COSName.FIRST, COSInteger.get(8));
> >      OutputStream outputStream = stream.createOutputStream();
> >      outputStream.write("6 0 4 2 1 2".getBytes());
> >      outputStream.close();
> >      PDFObjectStreamParser objectStreamParser = new
> > PDFObjectStreamParser(stream, null);
> >      Map<COSObjectKey, COSBase> objectNumbers =
> > objectStreamParser.parseAllObjects();
> >      assertEquals(2, objectNumbers.size());
> >      assertEquals(COSInteger.get (1), objectNumbers.get(new
> COSObjectKey(6, 0)));
> >      assertEquals(COSInteger.get (2), objectNumbers.get(new
> COSObjectKey(4, 0)));
> > }
> >
> >
> > while this one fails:
> >
> > @Test
> > void testParse2NumberObjectsNoSpace () throws IOException
> > {
> >      COSStream stream = new COSStream();
> >      stream.setItem(COSName.N, COSInteger.TWO);
> >      stream.setItem(COSName.FIRST, COSInteger.get(8));
> >      OutputStream outputStream = stream.createOutputStream();
> >      outputStream.write("6 0 4 *1* *12*".getBytes());
> >      outputStream.close();
> >      PDFObjectStreamParser objectStreamParser = new
> > PDFObjectStreamParser(stream, null);
> >      Map<COSObjectKey, COSBase> objectNumbers =
> > objectStreamParser.parseAllObjects();
> >      assertEquals(2, objectNumbers.size());
> >      assertEquals(COSInteger.get (1), objectNumbers.get(new
> COSObjectKey(6, 0)));
> >      assertEquals(COSInteger.get (2), objectNumbers.get(new
> COSObjectKey(4, 0)));
> > }
> >
> > with error:
> > org.opentest4j.AssertionFailedError:
> > Expected :COSInt{*1*}
> > Actual   :COSInt{*12*}
> >
> > at
> >
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> > ...
> > at
> >
> org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
> > ...
> > notes:
> >
> > a- the second object (number = 4), now indicates 1 as its offset and both
> > '1' and '2' are now 'joined'.
> >
> > b- the file was being created by on November last year and converted from
> > word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in': I
> do
> > expect to see such (valid) pdf construction more often in the (near)
> future.
> >
> > @ (Tilman & Andreas): I was able to have the pdfbox working by changing
> the
> > PDFObjectStreamParser implementation, rewriting the
> > privateReadObjectOffsets() method to return an array and using a parser
> > that does not parse beyond implicit limitation given by next object's
> > offset. let me know if you want to access this change.
> >
> > thank you,
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: pdfobjectstreamparser fail to parse content

Reply via email to