Vino, Thanks for the reply. Looking in ZK I see:
[zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs [d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1] Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though that job is no longer running (it was canceled while it was in a loop attempting to restart, but failing because of a lack of cluster slots). Any idea why that may be the case? On Wed, Aug 1, 2018 at 8:38 AM vino yang <yanghua1...@gmail.com> wrote: > If a job is explicitly canceled, its jobgraph node on ZK will be deleted. > However, it is worth noting here that Flink enables a background thread to > asynchronously delete the jobGraph node, > so there may be cases where it cannot be deleted. > On the other hand, the jobgraph node on ZK is the only basis for the JM > leader to restore the job. > There may be an unexpected recovery or an old job resurrection. >