Hi,

As per the Authoring I/O Transforms guide
<https://beam.apache.org/documentation/io/authoring-overview/>, the
recommended way to implement a Read transform (from a source that can be
read in parallel) has these steps:
- Splitting the data into parts to be read in parallel (ParDo)
- Reading from each of those parts (ParDo)
- With a GroupByKey in between the ParDo:s
The stated motivation for the GroupByKey is "it allows the runner to use
different numbers of workers" for the splitting and reading parts. Can
someone elaborate (or point to some relevant DOCs) on how the GroupByKey
will enable using different number of works for the two ParDo steps.

Thanks,
Mohamed

Reply via email to