Different runners decide it differently.

E.g. for the Dataflow runner: in batch mode, bundles are usually quite
large, e.g. something like several-dozen-MB chunks of files, or pretty big
key ranges of something like BigTable or GroupByKey output. The bundle
sizes are not known in advance (e.g. when the runner produces a bundle
"read key range [a, b)" the runner has no way of knowing how many keys
there actually are between a and b), and they even change as the job runs
[1]. In streaming mode, bundles usually close after either a few thousand
elements read from the source, or a few seconds, whichever happens first -
nothing too fancy going on.

Flink runner currently puts each element in its own bundle, but this is
quite inefficient and a known performance issue. Spark, I don't know.
Direct runner I think has a mix between these strategies.

Basically, if you want batching, you have to do it yourself, in a way that
does not violate runner bundle boundaries (don't batch across a
FinishBundle). In practice this is trivial to implement and never much of a
problem.

[1]
https://qconlondon.com/system/files/presentation-slides/straggler-free_data_processing_in_cloud_dataflow.pdf

On Tue, May 22, 2018 at 1:12 AM Abdul Qadeer <[email protected]> wrote:

> Hi Eugene!
>
> I had gone through that link before sending an email here. It does a
> decent job explaining when to use which method and what kind of
> optimisations we are looking at, but didn’t really answer the question I
> had i.e. the controlling granularity of elements of PCollection in a
> bundle. Kenneth made it clear that it is not in user control, but now I am
> interested to know how does the runner decide it.
>
>
> On May 21, 2018, at 7:55 PM, Eugene Kirpichov <[email protected]>
> wrote:
>
> Hi Abdul,
> Please see
> https://stackoverflow.com/questions/45985753/what-is-the-difference-between-dofn-setup-and-dofn-startbundle
>  -
> let me know if it answers your question sufficiently.
>
> On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer <[email protected]>
> wrote:
>
>> Hi!
>>
>> I was trying to understand the behavior of StartBundle and FinishBundle
>> w.r.t. DoFns.
>> I have an unbounded data source and I am trying to leverage bundling to
>> achieve batching.
>> From the docs of ParDo:
>>
>> "when a ParDo transform is executed, the elements of the input
>> PCollection are first divided up into some number of "bundles"
>>
>> I would like to know if bundling is possible for unbounded data in the
>> first place. If it is then how do I control the bundle size i.e. number of
>> elements of a given PCollection in that bundle?
>>
>
>

Reply via email to