On Wed, 16 May 2012, Alec Swan wrote:
Tika's parse() method is taking an InputStream as a parameter, so why does it consume so much memory? Can't it stage the file behind the scenes? Does Tika try to load the entire stream in memory all the time?
Not all file formats support stream based parsing, many can only be sensibly parsed in a DOM-like way. For those, the who file needs to be loaded into memory (and processed!) before the parser can work on them. PDF, DOCX and friends are some of the formats for which this is the case
Also, some parsers work better with a File, so if you're low on memory try using TikaInputStream.get(File), it may make a small difference
Nick
