> However, I'm not seeing how it would be necessary on every append since
the topology wouldn't be changing during the build of a single chunk
(correct me if I'm wrong.)

A StringArray, for example, stores all the strings in a single buffer. One
after the other. So after every append, the size of the data buffer can go
anywhere.

If you say you're going to append `len` strings, they could all be empty
(buffer grows by 0 bytes) or something like 1 Mb each (buffer grows by len
* 1Mb). Similar problems with ListArray which store all the elements from
the lists on the same child array. If that child array is a string array,
you're now 2 orders of uncertainty further from size estimation.

StringViewArray (a recent addition [1]) allows a more flexible chunking of
the data buffers [2].

--
Felipe

[1]
https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
[2]
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout



On Fri, Jul 5, 2024 at 1:35 PM Eric Jacobs <eric.jac...@collava.com> wrote:

> Felipe Oliveira Carvalho wrote:
> > Hi,
> > The builders can't really know the size of the buffers when nested
> > types are involved. The general solution would be an expensive
> > traversal of the entire tree of builders (e.g. struct builder of
> > nested column types like strings) on every append.
>
> I understand that the number and structure of the buffers used will be
> different depending on the datatype of the arrays, and I'm okay with
> doing a traversal of the builder tree to identify all of the buffers in
> use. However, I'm not seeing how it would be necessary on every append
> since the topology wouldn't be changing during the build of a single
> chunk (correct me if I'm wrong.) A re-traversal of the builder tree on a
> wider granularity basis (e.g. in between chunks) would be acceptable.
>
> > :
> > Also make sure you allow length to be > 0 because if a single string
> > is bigger than X MB, you will *have to* violate this max buffer
> > constraint. It can only be a soft constraint in a robust solution.
> >
>
> If there's no way that the constraint can be maintained as per the Arrow
> in-memory format, it will throw an error out from my MemoryPool, and in
> that case it just won't be supported here.
>
> Thanks,
> -Eric
>
> > __
> > Felipe
> >
> > On Thu, Jul 4, 2024 at 3:12 PM Eric Jacobs <eric.jac...@collava.com
> > <mailto:eric.jac...@collava.com>> wrote:
> >
> >     Hi,
> >     I would like to build a ChunkedArray but I need to limit the maximum
> >     size of each buffer (somewhere in the low MB's). Ending the current
> >     chunk and starting a new one is straightforward, but I'm having some
> >     difficulty detecting when the current buffer(s) are close to getting
> >     full. If I had the Builders I could check the length() as they are
> >     going
> >     along, but I'm not sure how I can get access to those as
> >     ChunkedArray is
> >     being built via the API.
> >
> >     The size control doesn't have to be precise in my case; it just
> >     needs to
> >     be conservative as a limit (i.e. the builder cannot go over X MB)
> >
> >       Any advice would be appreciated.
> >     Thanks,
> >     -Eric
> >
> >
>
>

Reply via email to