Nick, you were right. We tracked down the code that was swallowing the exception. After that I gave it 1024MB of heap space and it still ran out of memory while parsing 60 MB DOCX.
Tika's parse() method is taking an InputStream as a parameter, so why does it consume so much memory? Can't it stage the file behind the scenes? Does Tika try to load the entire stream in memory all the time? On Wed, May 16, 2012 at 4:08 PM, Nick Burch <[email protected]> wrote: > On Wed, 16 May 2012, Alec Swan wrote: >> >> Memory consumption stays under 90MB which is less than max heap size >> (128M). No out-of-memory errors are thrown during test > > > There is absolutely no way that you're going to be able to parse a PDF, > DOC/DOCX or PPT/PPTX of more than about 20mb in size on a 128mb heap (and > even that may be pushing it on some of them). Something is blowing up, I'd > make sure you're not accidently eating the exception > > Nick
