Hi,

I will not speak about details related to Flink specifically, the concept of watermarks is more abstract, so I'll leave implementation details aside.

Speaking generally, yes, there is a set of requirements that must be met in order to be able to generate a system that uses watermarks.

The primary question is what are watermarks used for? The answer is - we need watermarks to be able to define a partially stable order of _events_. Event is an immutable piece of data that can be _observed_ (i.e. processed) with various consumer-dependent delays (two consumers of the event can see the event at different processing times), or a specific (local) timestamp. Generally an event tells us that something, somewhere happened at given local timestamp.

Watermarks create markers in processing time of each observer, so that the observer is able to tell if two events (e.g. event "close time-window T1" and "new data with timestamp T2 arrived") can be ordered (that is being able to tell which one is - globally! - preceding the other).

Having said that - there is a general algebra for "timestamps" - and therefore watermarks. A timestamp can be any object that defines the following operations:

 - a less-than relation <, i.e. t1 < t2: bool, this relation needs to be a antisymmetric, so t1 < t2 implies not t2 < t1

 - a function min_timestamp_following(timestamp t1, duration): timestamp t2, that returns the minimal timestamp, for which t1 + duration < t2 (this function is actually a definition of duration)

These two conditions allows to construct a working streaming processing system, which means there should be no problem using different "timestamps", provided we know how to construct the above.

Using "a different number" for timestamps and watermarks seems valid in this sense, provided you are fine with the implicit definition of duration, that is currently defined as simple t2 - t1.

I tried to explain why it is not good to expect that two events can be globally ordered and what is the actual role of watermarks in this in a twitter thread [1], if anyone interested.

Best,

 Jan

[1] https://twitter.com/janl_apache/status/1478757956263071745?s=20&t=cMXfPHS8EjPrbF8jys43BQ

On 2/2/23 00:18, Yaroslav Tkachenko wrote:
Hey everyone,

I'm wondering if anyone has done any experiments trying to use non-temporal watermarks? For example, a dataset may contain some kind of virtual timestamp / version field that behaves just like a regular timestamp (monotonically increasing, etc.), but has a different scale / range.

As far as I can see Flink assumes that the values used for event times and watermark generation are actually timestamps and the Table API requires you to define watermarks on TIMESTAMP columns.

Practically speaking timestamp is just a number, so if I have a "timeline" that consists of 1000 monotonically increasing integers, for example, the concepts like late-arriving data, bounded-out-of-orderness, etc. still work.

Thanks for sharing any thoughts you might have on this topic!

Reply via email to