Re: Tika fails to extract text from very large files

Nick Burch Wed, 16 May 2012 16:08:13 -0700

On Wed, 16 May 2012, Alec Swan wrote:

Tika's parse() method is taking an InputStream as a parameter, so why
does it consume so much memory? Can't it stage the file behind the
scenes? Does Tika try to load the entire stream in memory all the
time?

Not all file formats support stream based parsing, many can only besensibly parsed in a DOM-like way. For those, the who file needs to beloaded into memory (and processed!) before the parser can work on them.PDF, DOCX and friends are some of the formats for which this is the case

Also, some parsers work better with a File, so if you're low on memory tryusing TikaInputStream.get(File), it may make a small difference


Nick

Re: Tika fails to extract text from very large files

Reply via email to