Hi Andreas, Currently, for each object, I do read it in byte[] and uses that bytes as content for new parser. For last referenced object, though, I leave the default stream for reading.
Let me know if I can help you sharing the update I performed: currently on a different branch but if you prefer, I can apply the changes to the trunk…. Let me know what best help you. On Fri 21 Feb 2025 at 08:28, Andreas Lehmkühler <andr...@lehmi.de.invalid> wrote: > Thanks for the sample data. I got your point. > > I'm afraid it isn't that easy to fix. > > It's easy to split up the stream in pieces based on the start offsets of > the trailing data. But the parser doesn't work like that, it simply > reads from a certain starting point and stops at the end of an object. > And another issue might be that there will some pdfs with broken > offsets, so that it might be a bad idea to work with splitted. > > However, let me thing about it. Maybe I'll come up with an idea > > Andreas > > Am 20.02.25 um 10:07 schrieb mountain the blue: > > Hi Andreas, > > > > I forgot to tell ... > > > > looking at the contained 13679_stream.dat, we see the first offsets for > the > > objects ... > > > > 5550 0 5551 5 5552 *7* 5553 *11* 5554 16 > > > > that actual data starting at offset 1990 (see the /First 1990 in the > > 13679_objstm.raw). > > looking at the content at offset 1990 in 13679_stream.dat, we have ... > > > > offset - 1990: 0 7 11 > > data : ... 3505 4 248*03*505 ... > > > > as we see, the bold '0' is the end of object 5552 ... while object 5553 > > starts at bolded '3': no separator between both tokens > > > > > > On Wed, Feb 19, 2025 at 9:36 PM mountain the blue < > thebluemount...@gmail.com> > > wrote: > > > >> Hi Andreas, > >> > >> Sorry for this delayed response. > >> > >> re: "Of course. it is better than nothing" > >> > >> I have uploaded a zip file carrying: > >> 1- the raw extract of the /ObjStream object and related stream content > >> 2- the decode content stream > >> > >> it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj > >> > >> re: "Old version of what major version?..." > >> Yes, this is a 2.x base > >> ... that borrowed some of the 3.x parsing you was working on as it was > >> solving numerous issues we were encountering then > >> > >> > >> On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler > >> <andr...@lehmi.de.invalid> wrote: > >> > >>> > >>> > >>> Am 17.02.25 um 22:16 schrieb mountain the blue: > >>>> hi Andreas, > >>>> > >>>> re: 'is there any chance ...' > >>>> I would have to ask for authorisation to the owner (a company) and I > >>> doubt > >>>> I could have it sent quickly. > >>>> I can, though, share the actual /ObjStm content, decompressed; let me > >>> know > >>>> if this would help you. > >>> Of course. it is better than nothing > >>> > >>>> re: "which version ..." > >>>> I am using an old version ... (that I am patching myself) ... > >>>> I can, however, reproduce it with current code on the trunk branch ... > >>>> (therefore, the 2 unit tests to exhibit the current behavior) > >>> Old version of what major version? 2.x or 3.x? > >>> > >>> > >>>> On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler > >>> <andr...@lehmi.de.invalid> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> is there any chance to get a hand on the pdf in question? > >>>>> > >>>>> Which version pd PDFBox are you using? > >>>>> > >>>>> Andreas > >>>>> > >>>>> Am 17.02.25 um 17:16 schrieb mountain the blue: > >>>>>> hi, > >>>>>> > >>>>>> first of all, many thanks for the contributors of the pdfbox project > >>> that > >>>>>> I've been using for long time for anything relating to pdf in java. > >>>>>> > >>>>>> I am using pdfbox to process various pdf files. > >>>>>> lately, I received a file whose parsing failed: > >>>>>> ie: > >>>>>> ... > >>>>>> Exception in thread "main" java.io.IOException: Error: Unknown > >>> annotation > >>>>>> type COSInt{49633506} > >>>>>> at > >>>>>> > >>>>> > >>> > org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198) > >>>>>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696) > >>>>>> at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663) > >>>>>> ... > >>>>>> > >>>>>> Looking further into this error, the reason was coming from the > >>> parsing > >>>>> of > >>>>>> /ObjStm ... that expects each object, serialised in the stream, to > >>> have > >>>>>> separator (ie; white space) while the > >>>>>> pdf was having some COS object serialised without the such > separation > >>>>>> > >>>>>> in the current code base, accessible on GitHub, the following test > >>>>> passes: > >>>>>> > >>>>>> @Test > >>>>>> void testParse2NumberObjects () throws IOException > >>>>>> { > >>>>>> COSStream stream = new COSStream(); > >>>>>> stream.setItem(COSName.N, COSInteger.TWO); > >>>>>> stream.setItem(COSName.FIRST, COSInteger.get(8)); > >>>>>> OutputStream outputStream = stream.createOutputStream(); > >>>>>> outputStream.write("6 0 4 2 1 2".getBytes()); > >>>>>> outputStream.close(); > >>>>>> PDFObjectStreamParser objectStreamParser = new > >>>>>> PDFObjectStreamParser(stream, null); > >>>>>> Map<COSObjectKey, COSBase> objectNumbers = > >>>>>> objectStreamParser.parseAllObjects(); > >>>>>> assertEquals(2, objectNumbers.size()); > >>>>>> assertEquals(COSInteger.get (1), objectNumbers.get(new > >>>>> COSObjectKey(6, 0))); > >>>>>> assertEquals(COSInteger.get (2), objectNumbers.get(new > >>>>> COSObjectKey(4, 0))); > >>>>>> } > >>>>>> > >>>>>> > >>>>>> while this one fails: > >>>>>> > >>>>>> @Test > >>>>>> void testParse2NumberObjectsNoSpace () throws IOException > >>>>>> { > >>>>>> COSStream stream = new COSStream(); > >>>>>> stream.setItem(COSName.N, COSInteger.TWO); > >>>>>> stream.setItem(COSName.FIRST, COSInteger.get(8)); > >>>>>> OutputStream outputStream = stream.createOutputStream(); > >>>>>> outputStream.write("6 0 4 *1* *12*".getBytes()); > >>>>>> outputStream.close(); > >>>>>> PDFObjectStreamParser objectStreamParser = new > >>>>>> PDFObjectStreamParser(stream, null); > >>>>>> Map<COSObjectKey, COSBase> objectNumbers = > >>>>>> objectStreamParser.parseAllObjects(); > >>>>>> assertEquals(2, objectNumbers.size()); > >>>>>> assertEquals(COSInteger.get (1), objectNumbers.get(new > >>>>> COSObjectKey(6, 0))); > >>>>>> assertEquals(COSInteger.get (2), objectNumbers.get(new > >>>>> COSObjectKey(4, 0))); > >>>>>> } > >>>>>> > >>>>>> with error: > >>>>>> org.opentest4j.AssertionFailedError: > >>>>>> Expected :COSInt{*1*} > >>>>>> Actual :COSInt{*12*} > >>>>>> > >>>>>> at > >>>>>> > >>>>> > >>> > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > >>>>>> ... > >>>>>> at > >>>>>> > >>>>> > >>> > org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103) > >>>>>> ... > >>>>>> notes: > >>>>>> > >>>>>> a- the second object (number = 4), now indicates 1 as its offset and > >>> both > >>>>>> '1' and '2' are now 'joined'. > >>>>>> > >>>>>> b- the file was being created by on November last year and converted > >>> from > >>>>>> word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture > Plug-in': > >>> I > >>>>> do > >>>>>> expect to see such (valid) pdf construction more often in the (near) > >>>>> future. > >>>>>> > >>>>>> @ (Tilman & Andreas): I was able to have the pdfbox working by > >>> changing > >>>>> the > >>>>>> PDFObjectStreamParser implementation, rewriting the > >>>>>> privateReadObjectOffsets() method to return an array and using a > >>> parser > >>>>>> that does not parse beyond implicit limitation given by next > object's > >>>>>> offset. let me know if you want to access this change. > >>>>>> > >>>>>> thank you, > >>>>>> > >>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>>> > >>>>> > >>>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>> > >>> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >