On Thu, 17 May 2012, Alec Swan wrote:
1. We don't know how to tell if we don't have enough heap space to process the file and skip the file in this case. Allowing out of memory errors take down our process is not acceptable.
In that kind of situation, you should be looking at using something like the fork parser or the tika server
2. When we use 1024MB of heap and try to parse a large PDF file at some point it starts printing the following error non-stop. In fact I forgot to kill my process and it ran over night printing this every second or so: May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream
That looks like a PDFBox bug, you should try reporting that upstream Nick
