Hi All,
We use zookeeper to achieve high availability of jobs. Recently, a failure 
occurred in our flink cluster. It was due to the abnormal downtime of the 
zookeeper service that all the flink jobs using this zookeeper all occurred 
failover. The failover startup of a large number of jobs in a short period of 
time caused the cluster The pressure is too high, which in turn causes the 
cluster to crash.
Afterwards, I checked the HA function of zk:
1. Leader election
2. Service discovery
3.State persistence:

The unavailability of the zookeeper service leads to failover of the flink job. 
It seems that because of the first point, JM cannot confirm whether it is 
Active or Standby, and the other two points should not affect it. But we didn't 
use the Standby JobManager.
So in my opinion, if the JobManager of Standby is not used, whether the zk 
service is available should not affect the jobs that are running normally(of 
course, it is understandable that the task cannot be recovered correctly if an 
exception occurs), and I don’t know if there is a way to achieve a similar 
purpose

Reply via email to