Thanks for the sample data. I got your point.

I'm afraid it isn't that easy to fix.

It's easy to split up the stream in pieces based on the start offsets of the trailing data. But the parser doesn't work like that, it simply reads from a certain starting point and stops at the end of an object. And another issue might be that there will some pdfs with broken offsets, so that it might be a bad idea to work with splitted.

However, let me thing about it. Maybe I'll come up with an idea

Andreas

Am 20.02.25 um 10:07 schrieb mountain the blue:
Hi Andreas,

I forgot to tell ...

looking at the contained 13679_stream.dat, we see the first offsets for the
objects ...

5550 0 5551 5 5552 *7* 5553 *11* 5554 16

that actual data starting at offset 1990 (see the /First 1990 in the
13679_objstm.raw).
looking at the content at offset 1990 in 13679_stream.dat, we have ...

offset - 1990: 0       7         11
data  :        ... 3505 4 248*03*505 ...

as we see, the bold '0' is the end of object 5552 ... while object 5553
starts at bolded '3': no separator between both tokens


On Wed, Feb 19, 2025 at 9:36 PM mountain the blue <thebluemount...@gmail.com>
wrote:

Hi Andreas,

Sorry for this delayed response.

re: "Of course. it is better than nothing"

I have uploaded a zip file carrying:
1- the raw extract of the /ObjStream object and related stream content
2- the decode content stream

it is accessible for 6 days @ https://filebin.net/8lz0dqbyhmiif1jj

re: "Old version of what major version?..."
Yes, this is a 2.x base
... that borrowed some of the 3.x parsing you was working on as it was
solving numerous issues we were encountering then


On Tue, Feb 18, 2025 at 7:57 AM Andreas Lehmkühler
<andr...@lehmi.de.invalid> wrote:



Am 17.02.25 um 22:16 schrieb mountain the blue:
hi Andreas,

re: 'is there any chance ...'
I would have to ask for authorisation to the owner (a company) and I
doubt
I could have it sent quickly.
I can, though, share the actual /ObjStm content, decompressed; let me
know
if this would help you.
Of course. it is better than nothing

re: "which version ..."
I am using an old version ... (that I am patching myself) ...
I can, however, reproduce it with current code on the trunk branch ...
(therefore, the 2 unit tests to exhibit the current behavior)
Old version of what major version? 2.x or 3.x?


On Mon, Feb 17, 2025 at 6:02 PM Andreas Lehmkühler
<andr...@lehmi.de.invalid>
wrote:

Hi,

is there any chance to get a hand on the pdf in question?

Which version pd PDFBox are you using?

Andreas

Am 17.02.25 um 17:16 schrieb mountain the blue:
hi,

first of all, many thanks for the contributors of the pdfbox project
that
I've been using for long time for anything relating to pdf in java.

I am using pdfbox to process various pdf files.
lately, I received a file whose parsing failed:
ie:
...
Exception in thread "main" java.io.IOException: Error: Unknown
annotation
type COSInt{49633506}
at


org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
...

Looking further into this error, the reason was coming from the
parsing
of
/ObjStm ... that expects each object, serialised in the stream, to
have
separator (ie; white space) while the
pdf was having some COS object serialised without the such separation

in the current code base, accessible on GitHub, the following test
passes:

@Test
void testParse2NumberObjects () throws IOException
{
       COSStream stream = new COSStream();
       stream.setItem(COSName.N, COSInteger.TWO);
       stream.setItem(COSName.FIRST, COSInteger.get(8));
       OutputStream outputStream = stream.createOutputStream();
       outputStream.write("6 0 4 2 1 2".getBytes());
       outputStream.close();
       PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
       Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
       assertEquals(2, objectNumbers.size());
       assertEquals(COSInteger.get (1), objectNumbers.get(new
COSObjectKey(6, 0)));
       assertEquals(COSInteger.get (2), objectNumbers.get(new
COSObjectKey(4, 0)));
}


while this one fails:

@Test
void testParse2NumberObjectsNoSpace () throws IOException
{
       COSStream stream = new COSStream();
       stream.setItem(COSName.N, COSInteger.TWO);
       stream.setItem(COSName.FIRST, COSInteger.get(8));
       OutputStream outputStream = stream.createOutputStream();
       outputStream.write("6 0 4 *1* *12*".getBytes());
       outputStream.close();
       PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
       Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
       assertEquals(2, objectNumbers.size());
       assertEquals(COSInteger.get (1), objectNumbers.get(new
COSObjectKey(6, 0)));
       assertEquals(COSInteger.get (2), objectNumbers.get(new
COSObjectKey(4, 0)));
}

with error:
org.opentest4j.AssertionFailedError:
Expected :COSInt{*1*}
Actual   :COSInt{*12*}

at


org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
...
at


org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
...
notes:

a- the second object (number = 4), now indicates 1 as its offset and
both
'1' and '2' are now 'joined'.

b- the file was being created by on November last year and converted
from
word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in':
I
do
expect to see such (valid) pdf construction more often in the (near)
future.

@ (Tilman & Andreas): I was able to have the pdfbox working by
changing
the
PDFObjectStreamParser implementation, rewriting the
privateReadObjectOffsets() method to return an array and using a
parser
that does not parse beyond implicit limitation given by next object's
offset. let me know if you want to access this change.

thank you,



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to