Google Dataflow job drain and correctness

KV 59 Wed, 03 Nov 2021 12:41:15 -0700

Hi,

I have a question on how the Dataflow drains a job. I have a job which
reads from PubSub and uses sliding windows to compute aggregates for each
window.


When I update the job and the new job is not compatible, I have an option
to either cancel or drain the job.

I want to understand how the drain works. When I update the job and it
resumes reading from PubSub my understanding is the job will lose all state
and it starts afresh.

For example if the last message read had timestamp T (The messages come
mostly in order as they're enough seconds apart)  then when I read a
message at timestamp T+1, the message will be assigned a window [T+1-S,
T+1)  and that window will not have any other messages and it breaks the
correctness of my aggregate.

Is my understanding correct? And is there any way to workaround it

Thanks
Kishore

Google Dataflow job drain and correctness

Reply via email to