Hi there,

I'm trying to determine optimal resource specifications for a Beam
PTransform which uses CoGroupByKey for large input collections.

I was wondering how exactly large CoGroupByKey jobs are handled in
Dataflow, specifically how the work is scaled and parallelized across
workers.

I read that for GroupByKey the parallelization is limited by the number of
keys (so the maximum number of workers that could be scaled up ignoring
other factors would be the number of keys).

I know GroupByKey is used by CoGroupByKey, so I was wondering whether the
parallelization would be the same, or whether it's limited by another step
in CoGroupByKey.

If anyone has any insight or could point me towards any docs that touch on
this it would be much appreciated!

Thanks!

Reply via email to