Re: [C++][Acero] can Acero support distributed computation？

Sasha Krassovsky Thu, 06 Jul 2023 20:07:48 -0700

Distributed aggregations have two phases: “compute” (each node does its own 
aggregation) and “merge” (the master merged the partial results). For most 
aggregates (e.g. Sum, Min, Max), merge is just the same as compute, except over 
the partial results. However, for Mean in particular, the compute phase will 
actually be sum and count separately, and the merge phase will be sum of 
partial sums divided by sum of partial counts. Count(distinct) will be merged 
by Sum as well. There may be other special cases I’m not thinking of, that’s 
the main idea.


Sasha 

> 6 июля 2023 г., в 19:29, Jiangtao Peng <[email protected]> написал(а):
> 
> 
> Hi Sasha,
>  
> Thanks for your reply!
>  
> Maybe Shuffle node is enough for data distribution. How about aggregation 
> node? For example, mean kernel with group will maintain sum and count value 
> for each group. Mater node merging mean result needs sum and count value on 
> each partition. But mean kernel seems not support method to export such sum 
> and count value, also, mean kernel doesn't support method to load these sum 
> and count value for additional merge. If it is feasible to provide 
> export/load method on aggregation kernel? Any other tips would be appreciated.
>  
> Thanks,
> Jiangtao
>  
> From: Sasha Krassovsky <[email protected]>
> Date: Friday, July 7, 2023 at 10:12 AM
> To: [email protected] <[email protected]>
> Subject: Re: [C++][Acero] can Acero support distributed computation？
> 
> Hi Jiangtao,
> Acero doesn’t support any distributed computation on its own. However, to get 
> some simple distributed computation going it would be sufficient to add a 
> Shuffle node. For example for Aggregation, the Shuffle would assign a range 
> of hashes to each node, and then each node would hash-partition its batches 
> locally and send each partition to be aggregated on the corresponding nodes. 
> You’d also need a master node to merge the results afterwards. 
>  
> In general the Shuffle-by-hash scheme works well for relational queries where 
> order doesn’t matter, but the time series functionality (i.e. as-of-join) 
> wouldn’t work as well. 
>  
> Hope this helps! 
> Sasha Krassovsky 
> 
> 
> 6 июля 2023 г., в 19:04, Jiangtao Peng <[email protected]> написал(а):
> 
> 
> Hi there,
> 
> I'm learning Acero streaming execution engine recently. And I’m wondering if 
> Acero support distributed computing.
>  
> I have read code about aggregation node and kernel; Aggregation kernel seems 
> to hide the details of aggregation middle state. If use multiple nodes with 
> Acero execution engine, how to split aggregation tasks?
>  
> If current execution engine does not support distributed computing, taking 
> aggregation as an example, how would you plan to transform the aggregate 
> kernel to support distributed computation?
>  
> Any help or tips would be appreciated.
> 
> Thanks,
> Jiangtao

Re: [C++][Acero] can Acero support distributed computation？

Reply via email to