This explains it. Thanks Reza! On Thu, Jan 3, 2019 at 1:19 AM Reza Ardeshir Rokni <[email protected]> wrote:
> Hi Mohamed, > > I believe this is related to fusion which is a feature of some of the > runners, you will be able to find more information on fusion on: > > > https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization > > Cheers > > Reza > > On Thu, 3 Jan 2019 at 04:09, Mohamed Haseeb <[email protected]> wrote: > >> Hi, >> >> As per the Authoring I/O Transforms guide >> <https://beam.apache.org/documentation/io/authoring-overview/>, the >> recommended way to implement a Read transform (from a source that can be >> read in parallel) has these steps: >> - Splitting the data into parts to be read in parallel (ParDo) >> - Reading from each of those parts (ParDo) >> - With a GroupByKey in between the ParDo:s >> The stated motivation for the GroupByKey is "it allows the runner to use >> different numbers of workers" for the splitting and reading parts. Can >> someone elaborate (or point to some relevant DOCs) on how the GroupByKey >> will enable using different number of works for the two ParDo steps. >> >> Thanks, >> Mohamed >> >
