Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Daniel Noll Mon, 28 Jan 2008 18:20:49 -0800

On Friday 25 January 2008 02:38:14 Nick Burch wrote:
> I did a bit, the core of which is now in svn as
> MissingRecordAwareHSSFListener


I discovered that, it's a great help for handling the blank cells, line 
endings and so forth while iterating through the cells.

> Is your formula related eventusermodel code in a format suitable for
> contributing back? It'd be handy to be able to put something in svn that
> would make dealing with the formula stuff much simpler. I'd be happy to
> spend a bit of time tidying it up / writing tests for it, if you could
> contribute it?

If I ever figure out how to handle it, I probably would contribute it back 
because it would mean changes to how shared formulas work.  At the moment as 
you say, it does require a Workbook.  At the moment I don't have a Workbook 
to work with.  Maybe I can store off the first however many records and then 
create the Workbook from those -- I haven't tried so I don't know what 
happens if you feed in a list of records without the ones which make up the 
read of the file.

> I think there was some talk a few years back, but nothing really came of
> it. The problem is that it'd take a large amount of programmer time, and
> memory seems to be fairly cheap.

Memory is indeed cheap, but unless you have the luxury of a 64-bit JVM, there 
is an upper limit of somewhere around 1.4GB, sometimes less.  This would 
normally be nearly 2GB but Windows allocates some DLLs in weird positions on 
some systems, and Sun insist on allocating a contiguous block of memory for 
the heap which sometimes causes a huge unusable memory hole above that.

"Normal" spreadsheets, where the number of cells isn't excessive, are not 
really a problem.  The problem is where some spreadsheet does have thousands 
of rows and/or dozens of columns.  Usually these will cause an OOME, but 
allocation which gets close to an OOME without causing one is actually more 
dangerous (some other thread suffers, too bad if it's something really 
important.)

> I'm not sure how that'd work though. If we don't hold the contents of the
> records in memory, then how are we going to be able to do anything with
> them? (Maybe I'm missing something in your suggestion though)

To convert an Excel spreadsheet to text (or another format), all you need to 
do is for each cell, store a text version somewhere (in a StringBuilder, in a 
temp file, etc.)  If you don't need to modify a cell then there is no reason 
to have it in memory.

In actual fact for us, something closer to RecordInputStream would be even 
better, where we can just say nextRecord() and have it return a properly 
constructed Record.  Then we have control over the loop, which is ideal when 
you need to return a Reader.

> My hunch is that we'll have a peak use of somewhere around 3-5 times the
> size of the excel file in memory, except for very small files. There'll be
> one copy of the file in poifs, another in hssf, then each record will take
> a copy as it parses itself.

There was one 40MB file which hit the 1GB memory limit.  It turns out the file 
had a huge number of cells per row, but opening the file showed most of them 
to be empty (they probably had styles or something on them which prompted 
HSSF to store something about it.)

Underlying issue here is that even if a cell doesn't exist, sometimes there is 
still memory allocated for it.  HSSFRow stores the cells in an array which 
means holes in the middle are still allocated a small amount of space.  And 
every HSSFCell holds references to many things.  All these eat up memory when 
you have a spreadsheet with a huge number of cells.

As far as the records keeping a copy, could they not instead keep an offset 
and a reference to the original buffer?  Then if someone calls a setter, it 
would need to create a new buffer, set the offset to 0 and copy the data 
before doing the actual set.

And as far as POIFS keeping a copy, yes... POIFS is full of issues like that.  
For instance, even if all you need to read is the CLSID, you still have to 
read the entire file.  If POIFSFileSystem could construct from a ByteBuffer 
and not take unnecessary copies, it could speed things up dramatically for 
that situation... but ultimately that would need to propagate to the whole 
framework for it to really show benefits.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Reply via email to