> However, I'm not seeing how it would be necessary on every append since the topology wouldn't be changing during the build of a single chunk (correct me if I'm wrong.)
A StringArray, for example, stores all the strings in a single buffer. One after the other. So after every append, the size of the data buffer can go anywhere. If you say you're going to append `len` strings, they could all be empty (buffer grows by 0 bytes) or something like 1 Mb each (buffer grows by len * 1Mb). Similar problems with ListArray which store all the elements from the lists on the same child array. If that child array is a string array, you're now 2 orders of uncertainty further from size estimation. StringViewArray (a recent addition [1]) allows a more flexible chunking of the data buffers [2]. -- Felipe [1] https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/ [2] https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout On Fri, Jul 5, 2024 at 1:35 PM Eric Jacobs <eric.jac...@collava.com> wrote: > Felipe Oliveira Carvalho wrote: > > Hi, > > The builders can't really know the size of the buffers when nested > > types are involved. The general solution would be an expensive > > traversal of the entire tree of builders (e.g. struct builder of > > nested column types like strings) on every append. > > I understand that the number and structure of the buffers used will be > different depending on the datatype of the arrays, and I'm okay with > doing a traversal of the builder tree to identify all of the buffers in > use. However, I'm not seeing how it would be necessary on every append > since the topology wouldn't be changing during the build of a single > chunk (correct me if I'm wrong.) A re-traversal of the builder tree on a > wider granularity basis (e.g. in between chunks) would be acceptable. > > > : > > Also make sure you allow length to be > 0 because if a single string > > is bigger than X MB, you will *have to* violate this max buffer > > constraint. It can only be a soft constraint in a robust solution. > > > > If there's no way that the constraint can be maintained as per the Arrow > in-memory format, it will throw an error out from my MemoryPool, and in > that case it just won't be supported here. > > Thanks, > -Eric > > > __ > > Felipe > > > > On Thu, Jul 4, 2024 at 3:12 PM Eric Jacobs <eric.jac...@collava.com > > <mailto:eric.jac...@collava.com>> wrote: > > > > Hi, > > I would like to build a ChunkedArray but I need to limit the maximum > > size of each buffer (somewhere in the low MB's). Ending the current > > chunk and starting a new one is straightforward, but I'm having some > > difficulty detecting when the current buffer(s) are close to getting > > full. If I had the Builders I could check the length() as they are > > going > > along, but I'm not sure how I can get access to those as > > ChunkedArray is > > being built via the API. > > > > The size control doesn't have to be precise in my case; it just > > needs to > > be conservative as a limit (i.e. the builder cannot go over X MB) > > > > Any advice would be appreciated. > > Thanks, > > -Eric > > > > > >