Re: Creating and populating Arrow table directly?

Uwe L. Korn Sun, 18 Oct 2020 12:42:52 -0700

Hello,

You actually can use NumPy arrays to construct an Arrow array without the need 
to copy any data. The important aspect here is to treat these NumPy arrays 
simply as plain memory allocations. You use it to construct the separate memory 
memory buffers (i.e. the valid-bits and data buffers) and then pass it top 
pyarrow.Array.from_buffers. One example of this can be seen here: 
https://github.com/xhochy/fletcher/blob/467054f0954c5c5796c6a5ced10891ce0e02e064/fletcher/algorithms/bool.py#L397-L407
 
<https://github.com/xhochy/fletcher/blob/467054f0954c5c5796c6a5ced10891ce0e02e064/fletcher/algorithms/bool.py#L397-L407>
 (context is a library where I construct a lot of pyarrow.Array instances from 
Python using Numba). Your underlying memory can be mutable as long as you keep 
it in the form of the Arrow memory format and treat it as immutable once you 
have passed your data structure to a third-party library/function.


Best
Uwe

> Am 13.10.2020 um 22:00 schrieb Jonathan Yu <[email protected]>:
> 
> Hey Jacob!
> 
> Thanks so much for your response and explanation.
> 
> On Mon, Oct 12, 2020 at 9:13 PM Jacob Quinn <[email protected] 
> <mailto:[email protected]>> wrote:
> I'm not familiar with the internals of the pyarrow implementation, but as the 
> primary author of the Arrow.jl Julia implementation, I think I can provide a 
> little insight that's probably applicable.
> 
> Awesome! One of these days, I hope to get around to learning Julia :) 
> 
> The conceptual problem here is that the arrow format is immutable; arrow data 
> is laid out in a fixed memory pattern. For the simplest column types 
> (integer, float, etc.), this isn't inherently a problem because it's 
> straightforward to allocate the fixed amount of memory, and then you could 
> set the memory slots for array elements. For non-"primitive" types, however, 
> it quickly becomes nontrivial. String columns, for example, are an 
> array-of-array memory layout, where all the string bytes are laid out end to 
> end, then individual column elements contain offsets into that fixed memory 
> blob. That can be pretty tricky to 1) pre-allocate and 2) allow "setting" 
> values afterward. You would need a way to allocate the exact # of bytes from 
> all the strings, then chase the right indirections when setting values and 
> generating the correct offsets.
> 
> This makes sense to me, but wasn't clear to me reading the documentation for 
> PyArrow, so I'll try to contribute something when I have some free time.
> 
> So with all that said, my guess is that creating the NumPy arrays with your 
> data and getting the data set first, then converting to the arrow format is 
> indeed an acceptable workflow. Hopefully that helps.
> 
> This also makes sense to me given the current available APIs. However, since 
> some of the documentation/blogs I've read about Arrow make reference to some 
> of the inefficiencies of converting to and from pandas or NumPy formats, 
> namely that some of the representations - like the bitmask for null values - 
> are different, I wonder whether it might make sense to provide "builder" 
> classes that can be used temporarily to construct arrays value by value, with 
> a "freeze" method that would then return the immutable arrays? The contract 
> would be that users would not be allowed to modify the data in the builder 
> class after invoking that build/freeze function. This is at least a common 
> pattern in Java to make it easier to build immutable data structures.
> 
> I suppose that may be a question for the dev@ list, based on the guidance in 
> "Contributing to Arrow" documentation.
>  
> 
> -Jacob
> 
> On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello there,
> 
> I'm recording an a-priori known number of entries per column, and I want to 
> create a Table using these entries. I'm currently using numpy.empty to 
> pre-allocate empty arrays, then creating a Table from that via the 
> pyarrow.table(data={}) constructor.
> 
> It seems a bit silly to create a bunch of NumPy arrays, only to convert them 
> to Arrow arrays to serialize. Is there any benefit to creating/populating 
> pyarrow.array() objects directly, and if so, how do I do that? Otherwise, is 
> the recommendation to first create a DataFrame in pandas (or a number of 
> NumPy arrays as I'm doing currently), then convert to a Table?
> 
> I think I want to have a way to create a fixed-size Table consisting of a 
> number of columns, then set the values for each column one by one (similar to 
> iloc/iat in pandas). Is this a sensible thing to try to do?
> 
> Best,
> 
> Jonathan
>

Re: Creating and populating Arrow table directly?

Reply via email to