I don't know how to reproduce it but what I've observed are three kinds of termination when connectivity with zookeeper is somehow disrupted. I don't think its an issue with zookeeper as it supports a much bigger kafka cluster since a few years.
1. The first kind is exactly this - https://github.com/apache/flink/pull/11338. Basically temporary loss of connectivity or rolling upgrade of zookeeper will cause job to terminate. It will restart eventually from where it left off. 2. The second kind is when job terminates and restarts for the same reason but is unable to recover from checkpoint. I think its similar to this - https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0 (from 1.11.2) will fix the second issue then I'll upgrade. 3. The third kind is where it repeatedly restarts as its unable to establish a session with Zookeeper. I don't know if reducing session timeout will help here but in this case, I'm forced to disable zookeeper HA entirely as the job cannot even restart here. I could create a JIRA ticket for discussion zookeeper itself if you suggest but the issue of zookeeper and savepoints are related as I'm not sure what will happen in each of the above. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/