Got the point. If you would like to get "correct" output, you may need to set global watermark as "min", because watermark is not only used for evicting rows in state, but also discarding input rows later than watermark. Here you may want to be aware that there're two stateful operators which will receive inputs from previous stage and discard them via watermark before processing.
Btw, you may also need to consider the difference of the concept of watermark between Spark and others: 1. Spark uses high watermark (picks highest event timestamp of input rows) even for single watermark whereas other frameworks use low watermark (picks lowest event timestamp of input rows). So you may always need to set enough delay on watermark. 2. Spark uses global watermark whereas other frameworks normally use operator-wise watermark. This is limitation of Spark (given outputs of previous stateful operator will become inputs of next stateful operator, they should have different watermark) and one of contributor proposes the approach [1] which would fit for Spark (unfortunately it haven't been reviewed by committers so long). Thanks, Jungtaek Lim (HeartSaVioR) 1. https://github.com/apache/spark/pull/23576 On Tue, Jun 11, 2019 at 7:06 AM Joe Ammann <j...@pyx.ch> wrote: > Hi all > > it took me some time to get the issues extracted into a piece of > standalone code. I created the following gist > > https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 > > I has messages for 4 topics A/B/C/D and a simple Python program which > shows 6 use cases, with my expectations and observations with Spark 2.4.3 > > It would be great if you could have a look and check if I'm doing > something wrong, or this is indeed a limitation of Spark? > > On 6/5/19 5:35 PM, Jungtaek Lim wrote: > > Nice to hear you're investigating the issue deeply. > > > > Btw, if attaching code is not easy, maybe you could share > logical/physical plan on any batch: "detail" in SQL tab would show up the > plan as string. Plans from sequential batches would be much helpful - and > streaming query status in these batch (especially watermark) should be > helpful too. > > > > > -- > CU, Joe > -- Name : Jungtaek Lim Blog : http://medium.com/@heartsavior Twitter : http://twitter.com/heartsavior LinkedIn : http://www.linkedin.com/in/heartsavior