Re: Pandas Block Manager

2021-02-01 Thread Joris Van den Bossche
I am also planning to actively work on this on the pandas side the coming month. Having early feedback on this work will be valuable. Joris On Thu, 28 Jan 2021 at 18:51, Wes McKinney wrote: > My position on this is that we should work with the pandas community > to work toward elimination of th

Re: Pandas Block Manager

2021-01-28 Thread Wes McKinney
My position on this is that we should work with the pandas community to work toward elimination of the BlockManager data structure as this will solve a multitude of problems and also make things better for Arrow. I am not supportive of the IPC format changes in the PR. On Wed, Jan 27, 2021 at 6:27

Re: Pandas Block Manager

2021-01-27 Thread Nicholas White
Hi all - just pinging this thread given the later discussions on the PR . I am proposing a backwards (but not forwards) compatible change to the spec to strike this line out When serializing Arrow data for interprocess communication, these alignment and pa

Re: Pandas Block Manager

2020-11-13 Thread Joris Van den Bossche
As Micah and Wes pointed out on the PR, this alignment/padding are requirements of the format specification. For reference, see here: https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding That's also the reason that I said earlier in this thread that such zero-copy convers

Re: Pandas Block Manager

2020-11-12 Thread Micah Kornfield
Hi Nicholas, I don't think allowing for flexibility of non 8 byte aligned types is a good idea. The specification explicitly calls out the alignment requirements and allowing for writers to output different non-aligned values potentially breaks other implementations. I'm not sure of your exact us

Re: Pandas Block Manager

2020-11-12 Thread Nicholas White
OK got everything to work, https://github.com/apache/arrow/pull/8644 (part of ARROW-10573 now) is ready for review. I've updated the test case to show it is possible to zero-copy a pandas DataFrame! The next step is to dig into `arrow_to_pandas.cc` to make it work automagically... On Wed, 11 Nov 2

Re: Pandas Block Manager

2020-11-11 Thread Nicholas White
Thanks all, this has been interesting. I've made a patch that sort-of does what I want[1] - I hope the test case is clear! I made the batch writer use the `alignment` field that was already in the `IpcWriteOptions` to align the buffers, instead of fixing their alignment at 8. Arrow then writes out

Re: Pandas Block Manager

2020-11-11 Thread Joris Van den Bossche
Hi Weston, When starting with a 2D ndarray, the conversion from numpy to pandas DataFrame (`pd.DataFrame(arr)`) is actually zero copy. But, pandas takes a transposed view on the original array (that's the reason the C contiguous flag changes), to ensure the column are the first dimension of the st

Re: Pandas Block Manager

2020-11-11 Thread Joris Van den Bossche
On Wed, 11 Nov 2020 at 00:52, Micah Kornfield wrote: > > Sorry, I should clarify, I'm not familiar with zero copy from Pandas to > Arrow, so there might be something else going on here. But once an arrow > file is written out, buffers will be padded/aligned to 8 bytes. > > In general, I think rel

Re: Pandas Block Manager

2020-11-11 Thread Weston Pace
Nick, it appears converting the ndarray to a dataframe clears the contiguous flag even though it doesn't actually change the underlying array. At least, this is what I'm seeing with my testing. My guess is this is what is causing arrow to do a copy (arrow is indeed doing a new allocation here, th

Re: Pandas Block Manager

2020-11-10 Thread Micah Kornfield
Sorry, I should clarify, I'm not familiar with zero copy from Pandas to Arrow, so there might be something else going on here. But once an arrow file is written out, buffers will be padded/aligned to 8 bytes. In general, I think relying on exact memory translation from systems that aren't used ar

Re: Pandas Block Manager

2020-11-10 Thread Micah Kornfield
> > My question is: why are these addresses not 40 bytes apart from each other? > What's in the gaps between the buffers? It's not null bitsets - there's > only one buffer for each column. Thanks - All buffers are padded to at least 8 bytes (and per the spec 64 is recommended). On Tue, Nov 10, 2

Re: Pandas Block Manager

2020-11-10 Thread Nicholas White
I've done a bit more digging. This code: df = pd.DataFrame(np.random.randint(10, size=(5, 5))) table = pa.Table.from_pandas(df) mem = [] for c in table.columns: buf = c.chunks[0].buffers()[1] mem.append((buf.address, buf.size)) sorted(mem) ...prints... [(140262915478912, 40)

Pandas Block Manager

2020-11-08 Thread Nicholas White
Hi - I've been looking through the Arrow specification format to look for ways to allow zero-copy creation of Pandas DataFrames (beyond `split_blocks`). Am I right in thinking that if you created an Arrow file (let's say of `m` rows and `n` columns of `float64`s for now) as a single RecordBatch the