Could you please clarify "fork parser" and "Tika server" concepts? Do both of them require spawning and managing external processes which perform the actual file parsing?
On Thu, May 17, 2012 at 10:12 AM, Nick Burch <[email protected]> wrote: > On Thu, 17 May 2012, Alec Swan wrote: >> >> 1. We don't know how to tell if we don't have enough heap space to >> process the file and skip the file in this case. Allowing out of >> memory errors take down our process is not acceptable. > > > In that kind of situation, you should be looking at using something like > the fork parser or the tika server > > >> 2. When we use 1024MB of heap and try to parse a large PDF file at >> some point it starts printing the following error non-stop. In fact I >> forgot to kill my process and it ran over night printing this every >> second or so: >> May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode >> SEVERE: Stop reading corrupt stream > > > That looks like a PDFBox bug, you should try reporting that upstream > > Nick
