hi,

first of all, many thanks for the contributors of the pdfbox project that
I've been using for long time for anything relating to pdf in java.

I am using pdfbox to process various pdf files.
lately, I received a file whose parsing failed:
ie:
...
Exception in thread "main" java.io.IOException: Error: Unknown annotation
type COSInt{49633506}
at
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:198)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:696)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:663)
...

Looking further into this error, the reason was coming from the parsing of
/ObjStm ... that expects each object, serialised in the stream, to have
separator (ie; white space) while the
pdf was having some COS object serialised without the such separation

in the current code base, accessible on GitHub, the following test passes:

@Test
void testParse2NumberObjects () throws IOException
{
    COSStream stream = new COSStream();
    stream.setItem(COSName.N, COSInteger.TWO);
    stream.setItem(COSName.FIRST, COSInteger.get(8));
    OutputStream outputStream = stream.createOutputStream();
    outputStream.write("6 0 4 2 1 2".getBytes());
    outputStream.close();
    PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
    Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
    assertEquals(2, objectNumbers.size());
    assertEquals(COSInteger.get (1), objectNumbers.get(new COSObjectKey(6, 0)));
    assertEquals(COSInteger.get (2), objectNumbers.get(new COSObjectKey(4, 0)));
}


while this one fails:

@Test
void testParse2NumberObjectsNoSpace () throws IOException
{
    COSStream stream = new COSStream();
    stream.setItem(COSName.N, COSInteger.TWO);
    stream.setItem(COSName.FIRST, COSInteger.get(8));
    OutputStream outputStream = stream.createOutputStream();
    outputStream.write("6 0 4 *1* *12*".getBytes());
    outputStream.close();
    PDFObjectStreamParser objectStreamParser = new
PDFObjectStreamParser(stream, null);
    Map<COSObjectKey, COSBase> objectNumbers =
objectStreamParser.parseAllObjects();
    assertEquals(2, objectNumbers.size());
    assertEquals(COSInteger.get (1), objectNumbers.get(new COSObjectKey(6, 0)));
    assertEquals(COSInteger.get (2), objectNumbers.get(new COSObjectKey(4, 0)));
}

with error:
org.opentest4j.AssertionFailedError:
Expected :COSInt{*1*}
Actual   :COSInt{*12*}

at
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
...
at
org.apache.pdfbox.pdfparser.PDFObjectStreamParserTest.testParse2NumberObjectsNoSpace(PDFObjectStreamParserTest.java:103)
...
notes:

a- the second object (number = 4), now indicates 1 as its offset and both
'1' and '2' are now 'joined'.

b- the file was being created by on November last year and converted from
word to pdf by 'Adobe Acrobat Pro (64-bit) 24 Paper Capture Plug-in': I do
expect to see such (valid) pdf construction more often in the (near) future.

@ (Tilman & Andreas): I was able to have the pdfbox working by changing the
PDFObjectStreamParser implementation, rewriting the
privateReadObjectOffsets() method to return an array and using a parser
that does not parse beyond implicit limitation given by next object's
offset. let me know if you want to access this change.

thank you,

Reply via email to