I get an out of heap memory error trying to load a 25 MB doc using Apache's
PDFBox. When I load a smaller doc, I have no problem. I have stripped the code
down to just loading the doc and trying to print the number of pages. Loading a
small doc works. I tried increasing the heap size. Here is the code:
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
public class LoadPDF {
private static String pdfFilename = "My24MBFile.pdf";
//private static String pdfFilename = "MyTinyFile.pdf";
public void runLoadPDF(String inPDF_Filename) {
PDDocument doc = null;
try {
System.out.println("Just BEFORE load Document");
doc = PDDocument.load(inPDF_Filename);
System.out.println("Just AFTER load Document");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("number of pages is: " + doc.getNumberOfPages() );
}
public static void main(String[] args){
LoadPDF readPDF = new LoadPDF();
readPDF.runLoadPDF(pdfFilename);
}
}
Here is the error from the system console in Eclipse:
Just BEFORE load Document
org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1036)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:961)
at LoadPDF.runLoadPDF(LoadPDF.java:13)
at LoadPDF.main(LoadPDF.java:25)
Caused by: java.lang.OutOfMemoryError: Java heap space
at
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
at org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
at
org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.flush(Unknown Source)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:448)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
... 5 more
Exception in thread "main" java.lang.NullPointerException
at LoadPDF.runLoadPDF(LoadPDF.java:19)
at LoadPDF.main(LoadPDF.java:25)
-- It seems to me that the doc is being loaded into memory. If this is so and I
can't even load a 25 MB doc, then I am in real trouble because we have much
bigger docs to load (hundreds of MB). Does anyone know if this is analagous to
parsing XML docs with the DOM parser? If so, is there an equivalent to the SAX
parser in either PDFBox or any other PDF library?
Thanks in advance for any help or advice.
- Frank