We leverage PDFBox to split pages into their individual pages for further processing. On one of the files we have, we receive the following error when processing with 3.0.0-RC1:
java.lang.IllegalStateException: Expected 'Page' but found COSName{Annot}: java.lang.RuntimeException java.lang.RuntimeException: java.lang.IllegalStateException: Expected 'Page' but found COSName{Annot} at com.me.handleRequest(RequestHandler.java:59) Caused by: java.lang.IllegalStateException: Expected 'Page' but found COSName{Annot} at org.apache.pdfbox.pdmodel.PDPageTree.sanitizeType(PDPageTree.java:261) at org.apache.pdfbox.pdmodel.PDPageTree.access$400(PDPageTree.java:43) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:219) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:167) at org.apache.pdfbox.multipdf.Splitter.processPages(Splitter.java:149) at org.apache.pdfbox.multipdf.Splitter.split(Splitter.java:89) at com.me.PdfSplit.lambda$splitBatch$3(PdfSplit.java:151) Looking at the latest release notes for 3.0.0-alpha3, we tried upgrading to see if it would maybe handle this error a little better. Running the file through, however, it *appeared* as though the file went through successfully, but instead it seems to *silently* fail when it comes across these pages it just ignores them instead of raising an IllegalStateException. 2023-01-11 20:31:01 <d7f4c620-e3ea-4560-9dae-3e2866118809> ERROR org.apache.pdfbox.pdmodel.PDPageTree:208 - Page skipped due to an invalid or missing type COSName{Annot} 2023-01-11 20:31:01 <d7f4c620-e3ea-4560-9dae-3e2866118809> ERROR org.apache.pdfbox.pdmodel.PDPageTree:208 - Page skipped due to an invalid or missing type COSName{XObject} My question is: was this intentional to silently fail? We realize that with the wide amount of content that we receive that there are going to be "bad" PDFs which is fine, but currently we are relying on PDFBox to tell us *when* it is something that we shouldn't continue any further post-processing on or not but if it silently fails, we think that if nothing blows up that it means that we've received all of the pages. If we were to go to alpha3, this would not be a true assumption any longer. Effectively we loop through a PDF to extract pages like so: Splitter splitter = new Splitter(); for(PDDocument page : splitter.split(document)) { // save each page for consumption later } Thanks in advance for any information that you can provide regarding our expectations of this behavior. - Levi