Hi! That is a super interesting idea. If I understand you correctly, you are suggesting to try and reconcile the TaskManagers and the JobManager before restarting the job. That would mean that in case of a master failure, the jobs may simply continue to run. That would be a nice enhancements, but I think it is slightly more complicated, for two reasons:
1) An assumption that we currently make is that the JobManager view and TaskManager view of the task status get out of sync (a JobManager failure is a reason that can happen), then the task is restarted. That is a quite robust solution, but may lead to restarts in cases that may be recovered otherwise as well. For example only certain state transitions are valid (like RUNNING to FINISHED). If the JobManager gets an update that the state is FINISHED when the JobManager thought it was CANCELED or CREATED, it will reject this under the assumption that something went wrong in the distributed coordination. If we want to keep this safety checks, it would probably need something like the JobManager asking TaskManagers that connect for their current status and resetting the Job status to that. 2) To make sure that JobManagers and TaskManagers do not confuse messages from different sessions (a session being a JobManager having leader role), we filter the critical messages by a "leaderSessionId", which is again very robust. This would also need a careful change. Hope that helps in understanding the current rational. If you want to work on improving this, it would be great. We should probably talk more about the detailed changes needed. Greetings, Stephan On Thu, Jan 14, 2016 at 3:49 AM, wangzhijiang999 <wangzhijiang...@aliyun.com > wrote: > Hi, > > As i know, when TaskManager send UpdateTaskExecutionState to > JobManager, if the JobManager failover and the future response is fail, the > task will be failed. Is it feasible to retry send UpdateTaskExecutionState > again when future response fail until success. In JobManager HA mode, > the UpdateTaskExecutionState should be success when the leader JobManager > active. Or are there any suggestions for sending messages during JobManager > failover instead of fail task. > > Thanks for any help in advance! > > > Zhijiang Wang >