Stateful processing of session data in order

Fabian Loris Fri, 15 Oct 2021 02:05:50 -0700

Hi folks,

In my scenario I’m aggregating data in session windows and then windowing
it back into the global window to process each key with a stateful DoFn.
I’m currently doing this in batch, where the order of the data is
important, so the session windows need to be processed in order. I couldn’t
find a standard way of doing this, but my tests show that the Dataflow
runner is correctly processing the data in order, but the DirectRunner is
not. So there seems to be some difference in the behavior of these two
runners. I know Apache Beam makes no guarantee on order, but it seems the
Dataflow runner does?!


Can some provide more details here as I couldn’t find anything? Did I
discover some undocumented behavior of the Dataflow runner? Would there be
some alternative approach as I'm doing it right now (see below)?

Although Dataflow is behaving as I want it, I'm still not sure if I
can/should rely on that. Furthermore, I can't test it in an automated way
as the DirectRunner is behaving differently.

Below you can find some code snippets. I use Scala with Scio, but I believe
it is still readable:

pipelineInput

.timestampBy(m => m.timestamp, allowedTimestampSkew =
Duration.standardMinutes(1))
          .withSessionWindows(Duration.minutes(10),..)
.applyKvTransform(WithKeys.of(new SerializableFunction[Model, String] {

       private[pipeline] override def apply(input: Model): String =
input.key

              }))

       .applyKvTransform(GroupByKey.create[String, Model]())

      .applyKvTransform(Window.remerge[KV[String,
java.lang.Iterable[Model]]]())

      .withGlobalWindow() // back to global window

      .applyTransform(ParDo.of(statefulDoFn)) // <- here the session
windows should be processed in order

Thanks a lot for your help :)
Fabian

Stateful processing of session data in order

Reply via email to