We are working on implementing a streaming aggregation to be available in Python but it probably won’t be available until the 5.0 release. I am not sure solving this problem efficiently is possible at 100GB scale with the tools currently available in pyarrow.
On Wed, Apr 7, 2021 at 12:41 PM Suresh V <[email protected]> wrote: > Hi .. I am trying to compute aggregates on large datasets (100GB) stored > in parquet format. Current approach is to use scan/fragement to load chunks > iteratively into memory and would like to run the equivalent of following > on each chunk using pc.compute functions > > df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max']) > > My understanding is that pc.compute needs to scan the entire array for > each of the functions. Please let me know if that is not the case and how > to optimize it. > > Thanks >
