> Note that there are a few aggregates that cannot be distributed (e.g. median).
You can distribute/parallelize based on the group-by key, but yes in general 
these so-called Holistic Aggregates (i.e. aggregates that must look at the 
entire partition) can’t be distributed within a key without a lot of 
synchronization overhead.  As a result you can’t avoid shuffling in the general 
case. You can skip shuffling if you can guarantee that your data is partitioned 
by the group-by key. 

> I'm not sure I understand the value in storing the intermediate result as a 
> materialized view instead of just storing the finalized computation.

I think this is a technique called “pre-aggregation”. I guess the value is to 
be able to insert rows into your partitions independently without having to 
synchronize on the finalized output. 

Sasha


> On Jul 7, 2023, at 10:45 AM, Weston Pace <[email protected]> wrote:
> 
> Note that there are a few aggregates that cannot be distributed (e.g. median).

Reply via email to