Re: [Python] add a new column to a table during dataset consolidation

Niranda Perera Mon, 24 Jan 2022 08:19:09 -0800

Hi Antonio,
Sorry I think I misunderstood your question. You are looking for a row-wise
mean, isn't it! I don't think there's an API for that pyarrow.compute.
Sorry my bad.
You could call `add` for each column and manually create the mean (this
would be a vectorized operation column-wise. But this would create 2
additional length-sized memory allocations at least AFAIU, because arrow
doesn't have mutable methods).
I wasn't aware that pyarrow API didnt have an add_column method (sorry
again!). It's available in C++ API. But for that also, you could simply
create a list with the existing columns.
Following would be my suggestion (not tested). But I agree, this is not as
pretty as the pandas solution! :-)
```
def calc_mean(batch, cols):
   res = batch[cols[0]]


  if len(cols) == 1:
  return res

   for c in cols[1:]:
     res = pa.compute.add(sum, batch[c])

  return pa.compute.divide(res, len(cols))

...

for batch in scanner.to_batches():
    new_cols = batch.columns
    new_cols.append(calc_mean(batch, cols))

    new_batch = pa.record_batch(data=new_cols,
       schema=batch.schema.append(pa.field('mean', pa.float64())))
    ...
```



On Mon, Jan 24, 2022 at 9:11 AM Antonino Ingargiola <[email protected]>
wrote:

> Hi Niranda,
>
> On Mon, Jan 24, 2022 at 2:41 PM Niranda Perera <[email protected]>
> wrote:
>
>> Did you try using `pyarrow.compute` options? Inside that batch iterator
>> loop you can call the compute mean function and then call the add_column
>> method for record batches.
>>
>
> I cannot find how to pass multiple columns to be aggregated to
> pyarrow.compute functions. As far as I understand pyarrow.compute functions
> only accept a single 1D pyarrow.array as input. Maybe you had something
> else in mind.
>
> Besides, I don't see any add_column or append_column method for
> pyarrow.RecordBatch[1]
>
> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
>
> The only solution I see is calling the compute function for each row of
> the RecordBatch (transforming each row to a pyarrow.array somehow). But
> this would be quite inefficient. On the contrary, pandas can compute the
> aggregation across columns in a vectorized way (at the additional cost of
> pyarrow <-> pandas roundtrip conversion).
>
> In the latest arrow code base might have support for 'projection', that
>> could do this without having to iterate through record batches. @Weston
>> Pace <[email protected]> WDYT?
>>
>
> If this is possible it would be great!
>
> Best,
> Antonio
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Re: [Python] add a new column to a table during dataset consolidation

Reply via email to