Re: HSSF and XSSF memory usage, some numbers

Alex Geller Wed, 04 May 2011 09:31:56 -0700

Hi,

I would like to give some feedback after implementing a solution based on
the "BigGridDemo".
The good news is that it is fast enough, uses little memory and handles all
the decorational items we were worrying about (CellStyles, MergedRegions and
Images), just fine. A .xlsx file with 300,000 rows and 33 columns is
generated in 50 seconds using the default heap space of 64 MB.

The BigGridDemo should be improved to replace the "sheetData" element with
the real data rather then than replacing the complete file because
MergedRegions, Image references and other information such as margins are
otherwise lost on overwriting the file.

Since we need to continue supporting HSSF as well as XSSF we implemented an
API compatible streaming version of XSSF called SXSSF ("S" for streaming) so
that we wouldn't have to change the code and could have identical code for
both file formats (Apart from the "Workbook" construction of course (change
from "new XSSFWorkbook" to "new SXSSFWorkbook")).

The bad news is, that we worry a little about maintaining the code. There
are a number of reasons for this which I would like to state:

We need to change our code each time a new method is added to the interfaces
"Workbook", "Sheet", "Row" or "Cell" since we implemented those from scratch
(Workbook and Sheet possibly could have been subclassed instead making a
composition).

There is a principle fragility in the solution that is caused by the fact
that the BigGridDemo strategy makes assumptions on inner workings of POI
that could change from one version to another. What if for example the
XSSFCell class gets a new pakage method XSSFCell.setCellSpan(int rowspan,int
colspan) and the public method XSSFSheet.addMergedRegion() is implemented on
top of this? Since the BigGridDemo strategy doesn't create any rows or cells
then the merged regions would stop working after this change.

We have some open points which don't bother us for the moment because we are
using only string, numeric and date cell values. We are not using formulas,
rich text, hyperlinks and comments but we may in the future:

What cell types and values have to be written into the "c" elements of the
"sheetData" structure for the cell types BLANK, FORMULA and ERROR? How is
rich text written?

How are comments and hyperlinks written into the file?

What are the exact semantics of Sheet.shiftRows()? Why is the code in
HSSFSheet complicated? Is it perhaps because of formulas that are cell
relative that need to be recomputed in the new location?

What should the return value for Row.getZeroHeight() be if
Row.setZeroHeight() has not been called previously?

How are cell formulas parsed? How is the type determined? What is the
lifecycle of formulas? What is the precomputed value and when is is
precomputed? What is the difference between "precomputed" and "cached" and
when are values cached?

All in all, we are quite happy now but we would very much prefer if you
agreed to maintain this code instead of us, for the reasons mentioned above.
Maybe there are other users with similar problems that would appreciate a
transparent API compatible version of the BigGridDemo solution too.

Thanks again,

Alex

--
View this message in context:
http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4370377.html
Sent from the POI - User mailing list archive at Nabble.com.

Re: HSSF and XSSF memory usage, some numbers

Reply via email to