Hi Niranda, On Mon, Jan 24, 2022 at 2:41 PM Niranda Perera <[email protected]> wrote:
> Did you try using `pyarrow.compute` options? Inside that batch iterator > loop you can call the compute mean function and then call the add_column > method for record batches. > I cannot find how to pass multiple columns to be aggregated to pyarrow.compute functions. As far as I understand pyarrow.compute functions only accept a single 1D pyarrow.array as input. Maybe you had something else in mind. Besides, I don't see any add_column or append_column method for pyarrow.RecordBatch[1] [1] https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html The only solution I see is calling the compute function for each row of the RecordBatch (transforming each row to a pyarrow.array somehow). But this would be quite inefficient. On the contrary, pandas can compute the aggregation across columns in a vectorized way (at the additional cost of pyarrow <-> pandas roundtrip conversion). In the latest arrow code base might have support for 'projection', that > could do this without having to iterate through record batches. @Weston > Pace <[email protected]> WDYT? > If this is possible it would be great! Best, Antonio
