Stephan Ewen created FLINK-4911: ----------------------------------- Summary: Non-disruptive JobManager Failures Key: FLINK-4911 URL: https://issues.apache.org/jira/browse/FLINK-4911 Project: Flink Issue Type: New Feature Components: JobManager Reporter: Stephan Ewen
JobManager failures can be handled in a non-disruptive way. The basic approach is the following: - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to the JobManager - On connect, the TaskManager tells the JobManager about its currently running tasks - A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs the ExecutionGraph state from these reports - Tasks whose status was not reconstructed in a certain time are assumed failed and trigger regular task recovery. To avoid having to re-implement this for the new JobManager / TaskManager approach in **flip-6**, I suggest to directly implement this into the {{flip-6}} feature branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)