This is a conceptual question, and the answer will probably depend on the
specific use case.

But to make the context clear, I'm joining multiple input streams by a
common (entity) id. Those inputs represent partial/disjoint updates for the
same entity. E.g., consider device interfaces, device vms, device metrics,
device vendor information, etc. I've personally worked on multiple such
assemblers and the event time strategy was never super clear. That is, how
to propagate the inputs' event times to the output, assembled entity.
Again, conceptually those services emit entity snapshots at a given time.

The strategies that I've commonly found in practice:

1. Use the max timestamp seen among the inputs (that has some nice
properties like being monotonically increasing but it can potentially hide
lag/delay in one of the inputs)
2. Use the timestamp of the latest processed event (this is how Flink would
propagate the timestamp by default, e.g., when using a CoProcessFunction;
this approach does not look semantically consistent plus the output
timestamp wouldn't be monotonically increasing)
3. Use the watermark in conjunction with event time timers to emit periodic
snapshots for example. This would recover the monotonicity at the expense
of extra latency, exacerbated by the fact that watermarks are not
calculated per-key so to say. On the flip side, it would immediately show
if one source is delayed, which might be desirable—or not!

Taking the chance for asking if there is something on the roadmap for
supporting per-key watermarks along the lines of this:
- https://github.com/TawfikYasser/Keyed-Watermarks-in-Apache-Flink

Thanks!

Salva

Reply via email to