Re: Tika fails to extract text from very large files

Alec Swan Wed, 16 May 2012 16:04:16 -0700

Nick, you were right. We tracked down the code that was swallowing the
exception. After that I gave it 1024MB of heap space and it still ran
out of memory while parsing 60 MB DOCX.


Tika's parse() method is taking an InputStream as a parameter, so why
does it consume so much memory? Can't it stage the file behind the
scenes? Does Tika try to load the entire stream in memory all the
time?

On Wed, May 16, 2012 at 4:08 PM, Nick Burch <[email protected]> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Memory consumption stays under 90MB which is less than max heap size
>> (128M). No out-of-memory errors are thrown during test
>
>
> There is absolutely no way that you're going to be able to parse a PDF,
> DOC/DOCX or PPT/PPTX of more than about 20mb in size on a 128mb heap (and
> even that may be pushing it on some of them). Something is blowing up, I'd
> make sure you're not accidently eating the exception
>
> Nick

Re: Tika fails to extract text from very large files

Reply via email to