Hi Eugene! I had gone through that link before sending an email here. It does a decent job explaining when to use which method and what kind of optimisations we are looking at, but didn’t really answer the question I had i.e. the controlling granularity of elements of PCollection in a bundle. Kenneth made it clear that it is not in user control, but now I am interested to know how does the runner decide it.
> On May 21, 2018, at 7:55 PM, Eugene Kirpichov <[email protected]> wrote: > > Hi Abdul, > Please see > https://stackoverflow.com/questions/45985753/what-is-the-difference-between-dofn-setup-and-dofn-startbundle > > <https://stackoverflow.com/questions/45985753/what-is-the-difference-between-dofn-setup-and-dofn-startbundle> > - let me know if it answers your question sufficiently. > > On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer <[email protected] > <mailto:[email protected]>> wrote: > Hi! > > I was trying to understand the behavior of StartBundle and FinishBundle > w.r.t. DoFns. > I have an unbounded data source and I am trying to leverage bundling to > achieve batching. > From the docs of ParDo: > > "when a ParDo transform is executed, the elements of the input PCollection > are first divided up into some number of "bundles" > > I would like to know if bundling is possible for unbounded data in the first > place. If it is then how do I control the bundle size i.e. number of elements > of a given PCollection in that bundle?
