>From the Tika documentation (lucene.apache.org/tika/documentation.html), I
read that Tika uses streamed parsing, and that "This allows even huge
documents to be parsed without excessive resource requirements."  But it
seems that my large xls file (240 megs) is being pulled completely into RAM,
which crashes when the heap is full.  The Tika class OfficeParser uses
org.apache.poi.poifs.storage.POIFSFFileSystem, and in the debugger I see the
following line (source comment included) being executed in that class:  

     // read the rest of the stream into blocks
     data_blocks = new RawDataBlockList(stream, bigBlockSize);

It does indeed seem to be trying to read the entire 240 megs into blocks. 
Am I missing something?  My main motivation for using Tika is that it seemed
to offer a way to process large xls files without pulling them into memory.

Thanks for any insights you can offer.
-- 
View this message in context: 
http://www.nabble.com/Large-xls-files-always-loaded-into-memory--tp23086987p23086987.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to