I can see in the logs that the JM 1 (10.210.22.167), that one that became leader after failover, thinks it deleted the 2a4eff355aef849c5ca37dbac04f2ff1 job from ZK when it was canceled:
July 30th 2018, 15:32:27.231 Trying to cancel job with ID 2a4eff355aef849c5ca37dbac04f2ff1. July 30th 2018, 15:32:27.232 Job Some Job (2a4eff355aef849c5ca37dbac04f2ff1) switched from state RESTARTING to CANCELED. July 30th 2018, 15:32:27.232 Stopping checkpoint coordinator for job 2a4eff355aef849c5ca37dbac04f2ff1 July 30th 2018, 15:32:27.239 Removed job graph 2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper. July 30th 2018, 15:32:27.245 Removing /flink/cluster_1/checkpoints/2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper July 30th 2018, 15:32:27.251 Removing /checkpoint-counter/2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper Both /flink/cluster_1/checkpoints/2a4eff355aef849c5ca37dbac04f2ff1 and /flink/cluster_1/checkpoint-counter/2a4eff355aef849c5ca37dbac04f2ff1 no longer exist, but for some reason the job graph as is still there. Looking at the ZK logs I find the problem: July 30th 2018, 15:32:27.241 Got user-level KeeperException when processing sessionid:0x2000001d2330001 type:delete cxid:0x434c zxid:0x60009dd94 txntype:-1 reqpath:n/a Error Path:/flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1 Error:KeeperErrorCode = Directory not empty for /flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1 Looking in ZK, we see: [zk: localhost:2181(CONNECTED) 0] ls /flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1 [d833418c-891a-4b5e-b983-080be803275c] >From the comments in ZooKeeperStateHandleStore.java I gather that this child node is used as a deletion lock. Looking at the contents of this ephemeral lock node: [zk: localhost:2181(CONNECTED) 16] get /flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1/d833418c-891a-4b5e-b983-080be803275c *10.210.42.62* cZxid = 0x60002ffa7 ctime = Tue Jun 12 20:01:26 UTC 2018 mZxid = 0x60002ffa7 mtime = Tue Jun 12 20:01:26 UTC 2018 pZxid = 0x60002ffa7 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x30000003f4a0003 dataLength = 12 numChildren = 0 and compared to the ephemeral node lock of the currently running job: [zk: localhost:2181(CONNECTED) 17] get /flink/cluster_1/jobgraphs/d77948df92813a68ea6dfd6783f40e7e/596a4add-9f5c-4113-99ec-9c942fe91172 *10.210.22.167* cZxid = 0x60009df4b ctime = Mon Jul 30 23:01:04 UTC 2018 mZxid = 0x60009df4b mtime = Mon Jul 30 23:01:04 UTC 2018 pZxid = 0x60009df4b cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2000001d2330001 dataLength = 13 numChildren = 0 Assuming the content of the nodes represent the owner, it seems the job graph for the old canceled job, 2a4eff355aef849c5ca37dbac04f2ff1, is locked by the previous JM leader, JM 2(10.210.42.62), while the running job locked by the current JM leader, JM 1 (10.210.22.167). Somehow the previous leader, JM 2, did not give up the lock when leadership failed over to JM 2. Shouldn't something call ZooKeeperStateHandleStore.releaseAll during HA failover to release the locks on the graphs? On Wed, Aug 1, 2018 at 9:49 AM Elias Levy <fearsome.lucid...@gmail.com> wrote: > Thanks for the reply. Looking in ZK I see: > > [zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs > [d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1] > > Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even > though that job is no longer running (it was canceled while it was in a > loop attempting to restart, but failing because of a lack of cluster slots). > > Any idea why that may be the case? > >>