On Wed, 16 May 2012, Alec Swan wrote:
Tika's parse() method is taking an InputStream as a parameter, so why
does it consume so much memory? Can't it stage the file behind the
scenes? Does Tika try to load the entire stream in memory all the
time?

Not all file formats support stream based parsing, many can only be sensibly parsed in a DOM-like way. For those, the who file needs to be loaded into memory (and processed!) before the parser can work on them. PDF, DOCX and friends are some of the formats for which this is the case

Also, some parsers work better with a File, so if you're low on memory try using TikaInputStream.get(File), it may make a small difference

Nick

Reply via email to