> Note that there are a few aggregates that cannot be distributed (e.g. median). You can distribute/parallelize based on the group-by key, but yes in general these so-called Holistic Aggregates (i.e. aggregates that must look at the entire partition) can’t be distributed within a key without a lot of synchronization overhead. As a result you can’t avoid shuffling in the general case. You can skip shuffling if you can guarantee that your data is partitioned by the group-by key.
> I'm not sure I understand the value in storing the intermediate result as a > materialized view instead of just storing the finalized computation. I think this is a technique called “pre-aggregation”. I guess the value is to be able to insert rows into your partitions independently without having to synchronize on the finalized output. Sasha > On Jul 7, 2023, at 10:45 AM, Weston Pace <[email protected]> wrote: > > Note that there are a few aggregates that cannot be distributed (e.g. median).
