Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Daniel Noll Mon, 11 Feb 2008 16:55:42 -0800

On Saturday 09 February 2008 05:37:12 Nick Burch wrote:
> I've been doing some reading up on ByteBuffer, and was wondering:
>
> On Mon, 4 Feb 2008, Daniel Noll wrote:
> >   1. Lower memory usage due to not keeping a byte[] copy of all data at
> > the POIFS level.
>
> How would this work? Surely we'll still need to read all the bytes that
> make up the whole poifs stream, then pass those into our underlying
> ByteBuffer? I couldn't figure out a way to do it without processing all
> the input stream at least once, since most of them won't support zipping
> about to different places
>
> >   2. If you don't ask for a DocumentInputStream for a given Document, the
> >      bytes don't even get read.  If you open a stream for a given
> > Document and only read the first part, the rest of the bytes don't even
> > get read.
>
> Again, not sure about that. I can see how we could possibly use a
> ByteBuffer to ensure we always use the same set of bytes in all the bits
> of poifs (and on up as required), but surely we'll still need to save the
> bytes of each DocumentInputStream, otherwise they'll be gone?


I don't follow.  Here's what I was thinking in more detail:

At the POIFS level:

  - The file loaded from disk is merely one big ByteBuffer. (easy)

  - A block in the file would be a ByteBuffer created as a subset over the
    larger file ByteBuffer (easy, Java allows for this already)

  - A document would be a ByteBuffer created as a composite ByteBuffer over
    the blocks which make it up (slightly less easy, requires custom
    ByteBuffer subclass to be written but such a thing will be a useful
    utility and probably should be in Commons if not the JRE itself.)

  - A new kind of DocumentInputStream is created which create a fresh copy
    of the ByteBuffer state and uses that to implement an InputStream. (easy)

With this, even if callers read every input stream, it will use only slightly 
more memory than what they store themselves.  The main memory usage at the 
POIFS level would be the storage of which block offsets make up which 
documents, and the directory tree information.

Of course if someone writes to a document it's a different story.  You would 
need to create a new ByteBuffer so as not to damage the original file (unless 
you design it to write to the original file -- probably harder.)

> > Of course the main beef I have with ByteBuffer is that it is limited to
> > Integer.MAX_VALUE size, but I guess with OLE2 this isn't, in practice,
> > going to be reached.  I imagine the maximum size for an OLE2 document is
> > somewhat lower, although I don't actually know.
>
> Nore do I, but I have a feeling it could well be 2gb too. Surely we have
> that 2gb limit already though, since we're reading the poifs data into a
> byte array, which has the same restriction?

True enough.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Reply via email to