Hi All, We use zookeeper to achieve high availability of jobs. Recently, a failure occurred in our flink cluster. It was due to the abnormal downtime of the zookeeper service that all the flink jobs using this zookeeper all occurred failover. The failover startup of a large number of jobs in a short period of time caused the cluster The pressure is too high, which in turn causes the cluster to crash. Afterwards, I checked the HA function of zk: 1. Leader election 2. Service discovery 3.State persistence:
The unavailability of the zookeeper service leads to failover of the flink job. It seems that because of the first point, JM cannot confirm whether it is Active or Standby, and the other two points should not affect it. But we didn't use the Standby JobManager. So in my opinion, if the JobManager of Standby is not used, whether the zk service is available should not affect the jobs that are running normally(of course, it is understandable that the task cannot be recovered correctly if an exception occurs), and I don’t know if there is a way to achieve a similar purpose