>From the Tika documentation (lucene.apache.org/tika/documentation.html), I read that Tika uses streamed parsing, and that "This allows even huge documents to be parsed without excessive resource requirements." But it seems that my large xls file (240 megs) is being pulled completely into RAM, which crashes when the heap is full. The Tika class OfficeParser uses org.apache.poi.poifs.storage.POIFSFFileSystem, and in the debugger I see the following line (source comment included) being executed in that class:
// read the rest of the stream into blocks data_blocks = new RawDataBlockList(stream, bigBlockSize); It does indeed seem to be trying to read the entire 240 megs into blocks. Am I missing something? My main motivation for using Tika is that it seemed to offer a way to process large xls files without pulling them into memory. Thanks for any insights you can offer. -- View this message in context: http://www.nabble.com/Large-xls-files-always-loaded-into-memory--tp23086987p23086987.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.