We leverage PDFBox to split pages into their individual pages for further
processing. On one of the files we have, we receive the following error
when processing with 3.0.0-RC1:

java.lang.IllegalStateException: Expected 'Page' but found COSName{Annot}:
java.lang.RuntimeException
java.lang.RuntimeException: java.lang.IllegalStateException: Expected
'Page' but found COSName{Annot}
        at com.me.handleRequest(RequestHandler.java:59)
Caused by: java.lang.IllegalStateException: Expected 'Page' but found
COSName{Annot}
        at
org.apache.pdfbox.pdmodel.PDPageTree.sanitizeType(PDPageTree.java:261)
        at
org.apache.pdfbox.pdmodel.PDPageTree.access$400(PDPageTree.java:43)
        at
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:219)
        at
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:167)
        at
org.apache.pdfbox.multipdf.Splitter.processPages(Splitter.java:149)
        at org.apache.pdfbox.multipdf.Splitter.split(Splitter.java:89)
        at com.me.PdfSplit.lambda$splitBatch$3(PdfSplit.java:151)

Looking at the latest release notes for 3.0.0-alpha3, we tried upgrading to
see if it would maybe handle this error a little better. Running the file
through, however, it *appeared* as though the file went through
successfully, but instead it seems to *silently* fail when it comes across
these pages it just ignores them instead of raising an
IllegalStateException.

2023-01-11 20:31:01 <d7f4c620-e3ea-4560-9dae-3e2866118809> ERROR
org.apache.pdfbox.pdmodel.PDPageTree:208 - Page skipped due to an invalid
or missing type COSName{Annot}
2023-01-11 20:31:01 <d7f4c620-e3ea-4560-9dae-3e2866118809> ERROR
org.apache.pdfbox.pdmodel.PDPageTree:208 - Page skipped due to an invalid
or missing type COSName{XObject}

My question is: was this intentional to silently fail? We realize that with
the wide amount of content that we receive that there are going to be "bad"
PDFs which is fine, but currently we are relying on PDFBox to tell us *when* it
is something that we shouldn't continue any further post-processing on or
not but if it silently fails, we think that if nothing blows up that it
means that we've received all of the pages. If we were to go to alpha3,
this would not be a true assumption any longer.

Effectively we loop through a PDF to extract pages like so:

Splitter splitter = new Splitter();
for(PDDocument page : splitter.split(document)) {
  // save each page for consumption later
}

Thanks in advance for any information that you can provide regarding our
expectations of this behavior.

- Levi

Reply via email to