On Friday 25 January 2008 02:38:14 Nick Burch wrote: > I did a bit, the core of which is now in svn as > MissingRecordAwareHSSFListener
I discovered that, it's a great help for handling the blank cells, line endings and so forth while iterating through the cells. > Is your formula related eventusermodel code in a format suitable for > contributing back? It'd be handy to be able to put something in svn that > would make dealing with the formula stuff much simpler. I'd be happy to > spend a bit of time tidying it up / writing tests for it, if you could > contribute it? If I ever figure out how to handle it, I probably would contribute it back because it would mean changes to how shared formulas work. At the moment as you say, it does require a Workbook. At the moment I don't have a Workbook to work with. Maybe I can store off the first however many records and then create the Workbook from those -- I haven't tried so I don't know what happens if you feed in a list of records without the ones which make up the read of the file. > I think there was some talk a few years back, but nothing really came of > it. The problem is that it'd take a large amount of programmer time, and > memory seems to be fairly cheap. Memory is indeed cheap, but unless you have the luxury of a 64-bit JVM, there is an upper limit of somewhere around 1.4GB, sometimes less. This would normally be nearly 2GB but Windows allocates some DLLs in weird positions on some systems, and Sun insist on allocating a contiguous block of memory for the heap which sometimes causes a huge unusable memory hole above that. "Normal" spreadsheets, where the number of cells isn't excessive, are not really a problem. The problem is where some spreadsheet does have thousands of rows and/or dozens of columns. Usually these will cause an OOME, but allocation which gets close to an OOME without causing one is actually more dangerous (some other thread suffers, too bad if it's something really important.) > I'm not sure how that'd work though. If we don't hold the contents of the > records in memory, then how are we going to be able to do anything with > them? (Maybe I'm missing something in your suggestion though) To convert an Excel spreadsheet to text (or another format), all you need to do is for each cell, store a text version somewhere (in a StringBuilder, in a temp file, etc.) If you don't need to modify a cell then there is no reason to have it in memory. In actual fact for us, something closer to RecordInputStream would be even better, where we can just say nextRecord() and have it return a properly constructed Record. Then we have control over the loop, which is ideal when you need to return a Reader. > My hunch is that we'll have a peak use of somewhere around 3-5 times the > size of the excel file in memory, except for very small files. There'll be > one copy of the file in poifs, another in hssf, then each record will take > a copy as it parses itself. There was one 40MB file which hit the 1GB memory limit. It turns out the file had a huge number of cells per row, but opening the file showed most of them to be empty (they probably had styles or something on them which prompted HSSF to store something about it.) Underlying issue here is that even if a cell doesn't exist, sometimes there is still memory allocated for it. HSSFRow stores the cells in an array which means holes in the middle are still allocated a small amount of space. And every HSSFCell holds references to many things. All these eat up memory when you have a spreadsheet with a huge number of cells. As far as the records keeping a copy, could they not instead keep an offset and a reference to the original buffer? Then if someone calls a setter, it would need to create a new buffer, set the offset to 0 and copy the data before doing the actual set. And as far as POIFS keeping a copy, yes... POIFS is full of issues like that. For instance, even if all you need to read is the CLSID, you still have to read the entire file. If POIFSFileSystem could construct from a ByteBuffer and not take unnecessary copies, it could speed things up dramatically for that situation... but ultimately that would need to propagate to the whole framework for it to really show benefits. Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
