Hi, I’m randomly getting Errors while loading PDF documents using PDDocument.load() method. Unfortunately I couldn’t reliably reproduce it, I just see it happening sometimes. Usually when retrying in the same process (and on the same machine) it will fail again. When retrying later it usually just works.
The document files are large, about 2 to 3 GB in average. The (virtual) machine where the process runs can consume up to 20 GB of memory. The stacktrace and error message is always the same (but it occurs at different places where PDDocument.load() is called) and looks like this: java.io.IOException: Requested page with index 2 was not written before. at org.apache.pdfbox.io.ScratchFile.readPage(ScratchFile.java:324) at org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:177) at org.apache.pdfbox.io.ScratchFileBuffer.read(ScratchFileBuffer.java:426) at org.apache.pdfbox.pdfparser.COSParser.isString(COSParser.java:2478) at org.apache.pdfbox.pdfparser.COSParser.bfSearchForLastEOFMarker(COSParser.java:1871) at org.apache.pdfbox.pdfparser.COSParser.bfSearchForObjects(COSParser.java:1556) at org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2196) at org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:281) at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1222) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1122) (it’s always “index 2”) Can anybody give me a hint why this error might occur? Could be be some (hidden) out-of-memory or out-of-disk-space issue? Could it be some PDFBox bug? Could it be some timing / caching / buffering issue? Or something else (what?)? Thanks for any hint. Best regards, Stefan