Till Rohrmann created FLINK-2289:
------------------------------------

             Summary: Make JobManager highly available
                 Key: FLINK-2289
                 URL: https://issues.apache.org/jira/browse/FLINK-2289
             Project: Flink
          Issue Type: Improvement
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann


Currently, the {{JobManager}} is the single point of failure in the Flink 
system. If it fails, then your job cannot be recovered and the Flink cluster is 
no longer able to receive new jobs.

Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the 
Flink cluster can recover from failed {{JobManager}}. As a first step towards 
this goal, I propose to make the {{JobManager}} highly available by starting 
multiple instances and using Apache ZooKeeper to elect a leader. The leader is 
responsible for the execution of the Flink job. 

In case that the {{JobManager}} dies, one of the other running {{JobManager}} 
will be elected as the leader and take over the role of the leader. The 
{{Client}} and the {{TaskManager}} will automatically detect the new 
{{JobManager}} by querying the ZooKeeper cluster.

Note that this does not achieve full fault tolerance for the {{JobManager}} but 
it allows the cluster to recover from failed {{JobManager}}. The design of 
high-availability for the {{JobManager}} is tracked in the wiki here [1].

Resources:
[1] 
[https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to