Hi Stephan,
 Thank you for detail explaination.  As you said, my opition is to keep task 
still running druing jobmanager failover, even though sending update status 
failed.
For the first reason you mentioned, if i understand correctly, the key issue is 
status out of sync between taskmanager and jobmanager. For example, when the 
jobmanager failover, the task is at CREATED status . When the task status 
transition to RUNNING, the updateStatus message can not be received because of 
jobmanager failover, then the taskmanager will retry sending the message to 
jobmanager until success. When the jobmanager recovers, the previous status of 
task is still CREATED in jobmanager view, and the task status maybe actually 
transition to FINISHED in taskmanager view. The key problem is that when the 
jobmanager received the FINISHED earlier than the RUNNING message, it will 
reject the FINISHED message.  If the task maintain a queue for sending message 
during jobmanager failover in order to confirm that the messages will be 
received in sequence at jobmanager when recover, that means the RUNNING status 
message must be arrived before FINISHED status message, are there any problems?
For the second reason you mentioned,  i am not very clear of the machenism of 
filtering the critical message by leaderSessionID, would you extend it in 
detail? 
I am trying to improve process of jobmanager and taskmanager failover, thank 
you for your help!
Zhijiang Wang

Reply via email to