Understanding combiner's distribution logic

Julien Phalip Tue, 30 Jun 2020 10:56:27 -0700

Hi,

I had a question about how combiners work, particularly on how the combined
PCollection's subsets are initially formed.

I understand that, according to the documentation
<https://beam.apache.org/documentation/programming-guide/#combine>, a
combiner allows parallelizing the computation to multiple workers by
breaking up the PCollection into subsets. I like the database analogy given
in this post
<https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind>,
which says that it is similar to pushing down a predicate.

I also understand that it is possible to use withFanout or withHotkeyFanout
to provide some explicit logic as a hint on how to manage the distribution.

What is unclear to me, however, is whether by default the runner already
plans the distribution of the computation, even when no explicit hints are
provided. I'm guessing perhaps it always breaks up the PCollection into
bundles
<https://beam.apache.org/documentation/runtime/model/#bundling-and-persistence>
(similar to DoFns), then the combiner runs the combination on each bundle,
saves the result into intermediary accumulators, and those results then
bubble up recursively to the top? If that's the case, then I assume that
the purpose of withFanout and withHotKeyFanout is to further break up those
initially pre-created bundles into even smaller subsets? Or am I guessing
this wrong? :)

I couldn't find a clear description in the documentation on how the
PCollection subsets are initially formed. Please let me know if you have
some details on that, or if it is already documented somewhere.

Thank you!

Julien

Understanding combiner's distribution logic

Reply via email to