Hi, Sorry for the late reply.
On Thu, Apr 16, 2009 at 11:38 PM, Mark Barton2 <mark.bar...@redwood.com> wrote: > From the Tika documentation (lucene.apache.org/tika/documentation.html), I > read that Tika uses streamed parsing, and that "This allows even huge > documents to be parsed without excessive resource requirements." Yes, that's one of the key design criteria for the Tika Parser interface. However, not all of the parser implementations are yet fully compliant with this design goal. > But it seems that my large xls file (240 megs) is being pulled completely into > RAM, which crashes when the heap is full. The Tika class OfficeParser uses > org.apache.poi.poifs.storage.POIFSFFileSystem, and in the debugger I see the > following line (source comment included) being executed in that class: > > // read the rest of the stream into blocks > data_blocks = new RawDataBlockList(stream, bigBlockSize); > > It does indeed seem to be trying to read the entire 240 megs into blocks. Yeah, that seems unfortunate. I'm not too into the POI internals, but I was always under the impression that it would just keep a list of data block _references_ in memory and would load the actual data only when needed. Maybe I'm mistaken. Anyway, it would be good to contact the POI project for more input on this. We're already using the HSSF Event API that designed for streaming, but perhaps there are some extra options that we should be using. Or then we simply need to fix this is POI. The "What's Next?" section in [1] mentions performance ("POI currently uses a lot of memory for large sheets") as an area of future improvement. [1] http://poi.apache.org/spreadsheet/how-to.html BR, Jukka Zitting