Understood, thanks for the clarification, I'll need to look more in-depth at my pipeline code then. I'm definitely observing that all steps downstream from the Stateful step in my pipeline do not start until steps upstream of the Stateful step are fully completed. The Stateful step is a RateLimit[1] transfer which borrows heavily from GroupIntoBatches.
[1] https://gist.github.com/egalpin/162a04b896dc7be1d0899acf17e676b3 On Thu, May 25, 2023 at 2:25 PM Robert Bradshaw via user < user@beam.apache.org> wrote: > The GbkBeforeStatefulParDo is an implementation detail used to send all > elements with the same key to the same worker (so that they can share > state, which is itself partitioned by worker). This does cause a global > barrier in batch pipelines. > > On Thu, May 25, 2023 at 2:15 PM Evan Galpin <egal...@apache.org> wrote: > >> Hi all, >> >> I'm running into a scenario where I feel that Dataflow Overrides >> (specifically BatchStatefulParDoOverrides.GbkBeforeStatefulParDo ) are >> unnecessarily causing a batch pipeline to "pause" throughput since a GBK >> needs to have processed all the data in a window before it can output. >> >> Is it strictly required that GbkBeforeStatefulParDo must run before any >> stateful DoFn? If not, what failure modes is GbkBeforeStatefulParDo trying >> to protect against, and how can it be bypassed/disabled while still using >> DataflowRunner? >> >> Thanks, >> Evan >> >