Hi,
I will not speak about details related to Flink specifically, the
concept of watermarks is more abstract, so I'll leave implementation
details aside.
Speaking generally, yes, there is a set of requirements that must be met
in order to be able to generate a system that uses watermarks.
The primary question is what are watermarks used for? The answer is - we
need watermarks to be able to define a partially stable order of
_events_. Event is an immutable piece of data that can be _observed_
(i.e. processed) with various consumer-dependent delays (two consumers
of the event can see the event at different processing times), or a
specific (local) timestamp. Generally an event tells us that something,
somewhere happened at given local timestamp.
Watermarks create markers in processing time of each observer, so that
the observer is able to tell if two events (e.g. event "close
time-window T1" and "new data with timestamp T2 arrived") can be ordered
(that is being able to tell which one is - globally! - preceding the other).
Having said that - there is a general algebra for "timestamps" - and
therefore watermarks. A timestamp can be any object that defines the
following operations:
- a less-than relation <, i.e. t1 < t2: bool, this relation needs to
be a antisymmetric, so t1 < t2 implies not t2 < t1
- a function min_timestamp_following(timestamp t1, duration):
timestamp t2, that returns the minimal timestamp, for which t1 +
duration < t2 (this function is actually a definition of duration)
These two conditions allows to construct a working streaming processing
system, which means there should be no problem using different
"timestamps", provided we know how to construct the above.
Using "a different number" for timestamps and watermarks seems valid in
this sense, provided you are fine with the implicit definition of
duration, that is currently defined as simple t2 - t1.
I tried to explain why it is not good to expect that two events can be
globally ordered and what is the actual role of watermarks in this in a
twitter thread [1], if anyone interested.
Best,
Jan
[1]
https://twitter.com/janl_apache/status/1478757956263071745?s=20&t=cMXfPHS8EjPrbF8jys43BQ
On 2/2/23 00:18, Yaroslav Tkachenko wrote:
Hey everyone,
I'm wondering if anyone has done any experiments trying to use
non-temporal watermarks? For example, a dataset may contain some kind
of virtual timestamp / version field that behaves just like a regular
timestamp (monotonically increasing, etc.), but has a different scale
/ range.
As far as I can see Flink assumes that the values used for event times
and watermark generation are actually timestamps and the Table API
requires you to define watermarks on TIMESTAMP columns.
Practically speaking timestamp is just a number, so if I have a
"timeline" that consists of 1000 monotonically increasing integers,
for example, the concepts like late-arriving
data, bounded-out-of-orderness, etc. still work.
Thanks for sharing any thoughts you might have on this topic!