We are working on implementing a streaming aggregation to be available in
Python but it probably won’t be available until the 5.0 release. I am not
sure solving this problem efficiently is possible at 100GB scale with the
tools currently available in pyarrow.

On Wed, Apr 7, 2021 at 12:41 PM Suresh V <[email protected]> wrote:

> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
> in parquet format. Current approach is to use scan/fragement to load chunks
> iteratively into memory and would like to run the equivalent of following
> on each chunk using pc.compute functions
>
> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>
> My understanding is that pc.compute needs to scan the entire array for
> each of the functions. Please let me know if that is not the case and how
> to optimize it.
>
> Thanks
>

Reply via email to