The GbkBeforeStatefulParDo is an implementation detail used to send all elements with the same key to the same worker (so that they can share state, which is itself partitioned by worker). This does cause a global barrier in batch pipelines.
On Thu, May 25, 2023 at 2:15 PM Evan Galpin <egal...@apache.org> wrote: > Hi all, > > I'm running into a scenario where I feel that Dataflow Overrides > (specifically BatchStatefulParDoOverrides.GbkBeforeStatefulParDo ) are > unnecessarily causing a batch pipeline to "pause" throughput since a GBK > needs to have processed all the data in a window before it can output. > > Is it strictly required that GbkBeforeStatefulParDo must run before any > stateful DoFn? If not, what failure modes is GbkBeforeStatefulParDo trying > to protect against, and how can it be bypassed/disabled while still using > DataflowRunner? > > Thanks, > Evan >