Watermark handling on initial query start (Structured Streaming)

Joe Ammann Mon, 20 May 2019 02:51:04 -0700

Hi all

I'm currently developing a Spark structured streaming application which 
joins/aggregates messages from ~7 Kafka topics and produces messages onto 
another Kafka topic.


Quite often in my development cycle, I want to "reprocess from scratch": I stop 
the program, delete the target topic and associated checkpoint information, and 
restart the application with the query.

My assumption would be that the newly started query then processes all messages 
that are on the input topics, sets the watermark according to the freshest 
messages on the topic and produces the output messages which have moved past 
the watermark and can thus be safely produced. As an example, if the freshest 
message on the topic has an event time of "2019-05-20 10:13" I restart the 
query at "2019-05-20 11:30" and I have a watermark duration of 10 minutes, I 
would expect the query to have a eventTime watermark of "2019-05-20 10:03" and 
all earlier results are produced.

But my observations indicate that after initial query startup and reading all 
input topics, the watermark stays at Unix epoch (1970-01-01) and no messages 
are produced. Only once a new message comes in, after the start of the query, 
then the watermark is moved ahead and all the messages are produced.

Is this the expected behaviour, and my assumption is wrong? Am I doing 
something wrong during query setup?

-- 
CU, Joe

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Watermark handling on initial query start (Structured Streaming)

Reply via email to