pdfobjectstreamparser fail to parse content

mountain the blue Mon, 17 Feb 2025 08:16:39 -0800

hi,

first of all, many thanks for the contributors of the pdfbox project that
I've been using for long time for anything relating to pdf in java.


I am using pdfbox to process various pdf files.
lately, I received a file whose parsing failed:
ie:
...
Exception in thread "main" java.io.IOException: Error: Unknown annotation
type COSInt{49633506}
at
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
...

Looking further into this error, the reason was coming from the parsing of
/ObjStm ... that expects each object, serialised in the stream, to have
separator (ie; white space) while the
pdf was having some COS object serialised without the such separation

in the current code base, accessible on GitHub, the following test passes:

@Test
void testParse2NumberObjects () throws IOException
{
    COSStream stream = new COSStream();
    stream.setItem(COSName.N, COSInteger.TWO);
    stream.setItem(COSName.FIRST, COSInteger.get(8));
    OutputStream outputStream = stream.createOutputStream();
    outputStream.write("6 0 4 2 1 2".getBytes());
    outputStream.close();
    PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
    Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
    assertEquals(2, objectNumbers.size());
    assertEquals(COSInteger.get (1), objectNumbers.get(new COSObjectKey(6, 0)));
    assertEquals(COSInteger.get (2), objectNumbers.get(new COSObjectKey(4, 0)));
}


while this one fails:

@Test
void testParse2NumberObjectsNoSpace () throws IOException
{
    COSStream stream = new COSStream();
    stream.setItem(COSName.N, COSInteger.TWO);
    stream.setItem(COSName.FIRST, COSInteger.get(8));
    OutputStream outputStream = stream.createOutputStream();
    outputStream.write("6 0 4 *1* *12*".getBytes());
    outputStream.close();
    PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
    Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
    assertEquals(2, objectNumbers.size());
    assertEquals(COSInteger.get (1), objectNumbers.get(new COSObjectKey(6, 0)));
    assertEquals(COSInteger.get (2), objectNumbers.get(new COSObjectKey(4, 0)));
}

with error:
org.opentest4j.AssertionFailedError:
Expected :COSInt{*1*}
Actual   :COSInt{*12*}

at
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
...
at
org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
...
notes:

a- the second object (number = 4), now indicates 1 as its offset and both
'1' and '2' are now 'joined'.

b- the file was being created by on November last year and converted from
word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in': I do
expect to see such (valid) pdf construction more often in the (near) future.

@ (Tilman & Andreas): I was able to have the pdfbox working by changing the
PDFObjectStreamParser implementation, rewriting the
privateReadObjectOffsets() method to return an array and using a parser
that does not parse beyond implicit limitation given by next object's
offset. let me know if you want to access this change.

thank you,

pdfobjectstreamparser fail to parse content

Reply via email to