On Thu, 17 May 2012, Alec Swan wrote:
1. We don't know how to tell if we don't have enough heap space to
process the file and skip the file in this case. Allowing out of
memory errors take down our process is not acceptable.

In that kind of situation, you should be looking at using something like
the fork parser or the tika server

2. When we use 1024MB of heap and try to parse a large PDF file at
some point it starts printing the following error non-stop. In fact I
forgot to kill my process and it ran over night printing this every
second or so:
May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream

That looks like a PDFBox bug, you should try reporting that upstream

Nick

Reply via email to