Re: Large xls files always loaded into memory?

Jukka Zitting Tue, 28 Apr 2009 03:45:18 -0700

Hi,

Sorry for the late reply.

On Thu, Apr 16, 2009 at 11:38 PM, Mark Barton2 <mark.bar...@redwood.com> wrote:
> From the Tika documentation (lucene.apache.org/tika/documentation.html), I
> read that Tika uses streamed parsing, and that "This allows even huge
> documents to be parsed without excessive resource requirements."

Yes, that's one of the key design criteria for the Tika Parser interface.

However, not all of the parser implementations are yet fully compliant
with this design goal.

> But it seems that my large xls file (240 megs) is being pulled completely into
> RAM, which crashes when the heap is full.  The Tika class OfficeParser uses
> org.apache.poi.poifs.storage.POIFSFFileSystem, and in the debugger I see the
> following line (source comment included) being executed in that class:
>
>     // read the rest of the stream into blocks
>     data_blocks = new RawDataBlockList(stream, bigBlockSize);
>
> It does indeed seem to be trying to read the entire 240 megs into blocks.

Yeah, that seems unfortunate. I'm not too into the POI internals, but
I was always under the impression that it would just keep a list of
data block _references_ in memory and would load the actual data only
when needed. Maybe I'm mistaken.

Anyway, it would be good to contact the POI project for more input on
this. We're already using the HSSF Event API that designed for
streaming, but perhaps there are some extra options that we should be
using.

Or then we simply need to fix this is POI. The "What's Next?" section
in [1] mentions performance ("POI currently uses a lot of memory for
large sheets") as an area of future improvement.

[1] http://poi.apache.org/spreadsheet/how-to.html

BR,

Jukka Zitting

Re: Large xls files always loaded into memory?

Reply via email to