Hi Andreas, Sorry for this delayed response.
re: "Of course. it is better than nothing" I have uploaded a zip file carrying: 1- the raw extract of the /ObjStream object and related stream content 2- the decode content stream it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj re: "Old version of what major version?..." Yes, this is a 2.x base ... that borrowed some of the 3.x parsing you was working on as it was solving numerous issues we were encountering then On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler <andr...@lehmi.de.invalid> wrote: > > > Am 17.02.25 um 22:16 schrieb mountain the blue: > > hi Andreas, > > > > re: 'is there any chance ...' > > I would have to ask for authorisation to the owner (a company) and I > doubt > > I could have it sent quickly. > > I can, though, share the actual /ObjStm content, decompressed; let me > know > > if this would help you. > Of course. it is better than nothing > > > re: "which version ..." > > I am using an old version ... (that I am patching myself) ... > > I can, however, reproduce it with current code on the trunk branch ... > > (therefore, the 2 unit tests to exhibit the current behavior) > Old version of what major version? 2.x or 3.x? > > > > On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler > <andr...@lehmi.de.invalid> > > wrote: > > > >> Hi, > >> > >> is there any chance to get a hand on the pdf in question? > >> > >> Which version pd PDFBox are you using? > >> > >> Andreas > >> > >> Am 17.02.25 um 17:16 schrieb mountain the blue: > >>> hi, > >>> > >>> first of all, many thanks for the contributors of the pdfbox project > that > >>> I've been using for long time for anything relating to pdf in java. > >>> > >>> I am using pdfbox to process various pdf files. > >>> lately, I received a file whose parsing failed: > >>> ie: > >>> ... > >>> Exception in thread "main" java.io.IOException: Error: Unknown > annotation > >>> type COSInt{49633506} > >>> at > >>> > >> > org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198) > >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696) > >>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663) > >>> ... > >>> > >>> Looking further into this error, the reason was coming from the parsing > >> of > >>> /ObjStm ... that expects each object, serialised in the stream, to have > >>> separator (ie; white space) while the > >>> pdf was having some COS object serialised without the such separation > >>> > >>> in the current code base, accessible on GitHub, the following test > >> passes: > >>> > >>> @Test > >>> void testParse2NumberObjects () throws IOException > >>> { > >>> COSStream stream = new COSStream(); > >>> stream.setItem(COSName.N, COSInteger.TWO); > >>> stream.setItem(COSName.FIRST, COSInteger.get(8)); > >>> OutputStream outputStream = stream.createOutputStream(); > >>> outputStream.write("6 0 4 2 1 2".getBytes()); > >>> outputStream.close(); > >>> PDFObjectStreamParser objectStreamParser = new > >>> PDFObjectStreamParser(stream, null); > >>> Map<COSObjectKey, COSBase> objectNumbers = > >>> objectStreamParser.parseAllObjects(); > >>> assertEquals(2, objectNumbers.size()); > >>> assertEquals(COSInteger.get (1), objectNumbers.get(new > >> COSObjectKey(6, 0))); > >>> assertEquals(COSInteger.get (2), objectNumbers.get(new > >> COSObjectKey(4, 0))); > >>> } > >>> > >>> > >>> while this one fails: > >>> > >>> @Test > >>> void testParse2NumberObjectsNoSpace () throws IOException > >>> { > >>> COSStream stream = new COSStream(); > >>> stream.setItem(COSName.N, COSInteger.TWO); > >>> stream.setItem(COSName.FIRST, COSInteger.get(8)); > >>> OutputStream outputStream = stream.createOutputStream(); > >>> outputStream.write("6 0 4 *1* *12*".getBytes()); > >>> outputStream.close(); > >>> PDFObjectStreamParser objectStreamParser = new > >>> PDFObjectStreamParser(stream, null); > >>> Map<COSObjectKey, COSBase> objectNumbers = > >>> objectStreamParser.parseAllObjects(); > >>> assertEquals(2, objectNumbers.size()); > >>> assertEquals(COSInteger.get (1), objectNumbers.get(new > >> COSObjectKey(6, 0))); > >>> assertEquals(COSInteger.get (2), objectNumbers.get(new > >> COSObjectKey(4, 0))); > >>> } > >>> > >>> with error: > >>> org.opentest4j.AssertionFailedError: > >>> Expected :COSInt{*1*} > >>> Actual :COSInt{*12*} > >>> > >>> at > >>> > >> > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > >>> ... > >>> at > >>> > >> > org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103) > >>> ... > >>> notes: > >>> > >>> a- the second object (number = 4), now indicates 1 as its offset and > both > >>> '1' and '2' are now 'joined'. > >>> > >>> b- the file was being created by on November last year and converted > from > >>> word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in': I > >> do > >>> expect to see such (valid) pdf construction more often in the (near) > >> future. > >>> > >>> @ (Tilman & Andreas): I was able to have the pdfbox working by changing > >> the > >>> PDFObjectStreamParser implementation, rewriting the > >>> privateReadObjectOffsets() method to return an array and using a parser > >>> that does not parse beyond implicit limitation given by next object's > >>> offset. let me know if you want to access this change. > >>> > >>> thank you, > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >