Re: [Python] add a new column to a table during dataset consolidation

Antonino Ingargiola Mon, 24 Jan 2022 06:11:49 -0800

Hi Niranda,

On Mon, Jan 24, 2022 at 2:41 PM Niranda Perera <[email protected]>
wrote:


> Did you try using `pyarrow.compute` options? Inside that batch iterator
> loop you can call the compute mean function and then call the add_column
> method for record batches.
>

I cannot find how to pass multiple columns to be aggregated to
pyarrow.compute functions. As far as I understand pyarrow.compute functions
only accept a single 1D pyarrow.array as input. Maybe you had something
else in mind.

Besides, I don't see any add_column or append_column method for
pyarrow.RecordBatch[1]

[1] https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html

The only solution I see is calling the compute function for each row of the
RecordBatch (transforming each row to a pyarrow.array somehow). But this
would be quite inefficient. On the contrary, pandas can compute the
aggregation across columns in a vectorized way (at the additional cost of
pyarrow <-> pandas roundtrip conversion).

In the latest arrow code base might have support for 'projection', that
> could do this without having to iterate through record batches. @Weston
> Pace <[email protected]> WDYT?
>

If this is possible it would be great!

Best,
Antonio

Re: [Python] add a new column to a table during dataset consolidation

Reply via email to