Hey Jacob! Thanks so much for your response and explanation.
On Mon, Oct 12, 2020 at 9:13 PM Jacob Quinn <[email protected]> wrote: > I'm not familiar with the internals of the pyarrow implementation, but as > the primary author of the Arrow.jl Julia implementation, I think I can > provide a little insight that's probably applicable. > Awesome! One of these days, I hope to get around to learning Julia :) > > The conceptual problem here is that the arrow format is immutable; arrow > data is laid out in a fixed memory pattern. For the simplest column types > (integer, float, etc.), this isn't inherently a problem because it's > straightforward to allocate the fixed amount of memory, and then you could > set the memory slots for array elements. For non-"primitive" types, > however, it quickly becomes nontrivial. String columns, for example, are an > array-of-array memory layout, where all the string bytes are laid out end > to end, then individual column elements contain offsets into that fixed > memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow > "setting" values afterward. You would need a way to allocate the exact # of > bytes from all the strings, then chase the right indirections when setting > values and generating the correct offsets. > This makes sense to me, but wasn't clear to me reading the documentation for PyArrow, so I'll try to contribute something when I have some free time. So with all that said, my guess is that creating the NumPy arrays with your > data and getting the data set first, then converting to the arrow format is > indeed an acceptable workflow. Hopefully that helps. > This also makes sense to me given the current available APIs. However, since some of the documentation/blogs I've read about Arrow make reference to some of the inefficiencies of converting to and from pandas or NumPy formats, namely that some of the representations - like the bitmask for null values - are different, I wonder whether it might make sense to provide "builder" classes that can be used temporarily to construct arrays value by value, with a "freeze" method that would then return the immutable arrays? The contract would be that users would not be allowed to modify the data in the builder class after invoking that build/freeze function. This is at least a common pattern in Java to make it easier to build immutable data structures. I suppose that may be a question for the dev@ list, based on the guidance in "Contributing to Arrow" documentation. > > -Jacob > > On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <[email protected]> > wrote: > >> Hello there, >> >> I'm recording an a-priori known number of entries per column, and I want >> to create a Table using these entries. I'm currently using numpy.empty to >> pre-allocate empty arrays, then creating a Table from that via the >> pyarrow.table(data={}) constructor. >> >> It seems a bit silly to create a bunch of NumPy arrays, only to convert >> them to Arrow arrays to serialize. Is there any benefit to >> creating/populating pyarrow.array() objects directly, and if so, how do I >> do that? Otherwise, is the recommendation to first create a DataFrame in >> pandas (or a number of NumPy arrays as I'm doing currently), then convert >> to a Table? >> >> I think I want to have a way to create a fixed-size Table consisting of a >> number of columns, then set the values for each column one by one (similar >> to iloc/iat in pandas). Is this a sensible thing to try to do? >> >> Best, >> >> Jonathan >> >
