Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-03-03 Thread David Anderson
When bounded Flink sources reach the end of their input, a special watermark with the value Watermark.MAX_WATERMARK is emitted that will take care of flushing all windows. One approach is to use a DeserializationSchema or KafkaDeserializationSchema with an implementation of isEndOfStream that

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-03-01 Thread Rion Williams
Hey David et all, I had one follow up question for this as I've been putting together some integration/unit tests to verify that things are working as expected with finite datasets (e.g. a text file with several hundred records that are serialized, injected into Kafka, and processed through the

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-27 Thread Rion Williams
Thanks David, I figured that the correct approach would obviously be to adopt a keying strategy upstream to ensure the same data that I used as a key downstream fell on the same partition (ensuring the ordering guarantees I’m looking for). I’m guessing implementation-wise, when I would

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-27 Thread David Anderson
Rion, If you can arrange for each tenant's events to be in only one kafka partition, that should be the best way to simplify the processing you need to do. Otherwise, a simple change that may help would be to increase the bounded delay you use in calculating your own per-tenant watermarks,

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-26 Thread Rion Williams
David and Timo, Firstly, thank you both so much for your contributions and advice. I believe I’ve implemented things along the lines that you both detailed and things appear to work just as expected (e.g. I can see things arriving, being added to windows, discarding late records, and

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-26 Thread David Anderson
Yes indeed, Timo is correct -- I am proposing that you not use timers at all. Watermarks and event-time timers go hand in hand -- and neither mechanism can satisfy your requirements. You can instead put all of the timing logic in the processElement method -- effectively emulating what you would

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-26 Thread Timo Walther
Hi Rion, I think what David was refering to is that you do the entire time handling yourself in process function. That means not using the `context.timerService()` or `onTimer()` that Flink provides but calling your own logic based on the timestamps that enter your process function and the

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-25 Thread Rion Williams
 Hi David, Thanks for your prompt reply, it was very helpful and the PseudoWindow example is excellent. I believe it closely aligns with an approach that I was tinkering with but seemed to be missing a few key pieces. In my case, I'm essentially going to want to be aggregating the messages

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-25 Thread David Anderson
Rion, What you want isn't really achievable with the APIs you are using. Without some sort of per-key (per-tenant) watermarking -- which Flink doesn't offer -- the watermarks and windows for one tenant can be held up by the failure of another tenant's events to arrive in a timely manner.

Handling Data Separation / Watermarking from Kafka in Flink

2021-02-25 Thread Rion Williams
Hey folks, I have a somewhat high-level/advice question regarding Flink and if it has the mechanisms in place to accomplish what I’m trying to do. I’ve spent a good bit of time using Apache Beam, but recently pivoted over to native Flink simply because some of the connectors weren’t as mature or