Hi everyone, I have some questions I want to ask about how windowing,
triggering, and panes work together, and how to ensure correctness
throughout a pipeline.

Lets assume I have a very simple streaming pipeline that looks like:
Source -> CombineByKey (Sum) -> Sink

Given fixed windows of 1 hour, early firings every minute, and accumulating
panes, this is pretty straight forward.  However, this can get more
complicated if we add steps after the CombineByKey, for instance (using the
same windowing strategy):

Say I want to buffer the results of the CombineByKey into batches of N
elements.  I can do this with the built-in GroupIntoBatches [1] transform,
now my pipeline looks like:

Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink

*This leads to my main question:*
Is ordering preserved somehow here?  ie: is it possible that the result
from early firing F+1 now comes BEFORE the firing F (because it was
re-ordered in the GroupIntoBatches).  This would mean that the sink then
gets F+1 before F, which means my resulting store has incorrect data
(possibly forever if F+1 was the final firing).

If ordering is not preserved, it seems as if I'd need to introduce my own
ordering back in after GroupIntoBatches.  GIB is an example here, but I
imagine this could happen with any GBK type operation.

Am I thinking about this the correct way?  Thanks!

[1]
https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html

Reply via email to