I can see in the logs that the JM 1 (10.210.22.167), that one that became
leader after failover, thinks it deleted
the 2a4eff355aef849c5ca37dbac04f2ff1 job from ZK when it was canceled:

July 30th 2018, 15:32:27.231 Trying to cancel job with ID
2a4eff355aef849c5ca37dbac04f2ff1.
July 30th 2018, 15:32:27.232 Job Some Job
(2a4eff355aef849c5ca37dbac04f2ff1) switched from state RESTARTING to
CANCELED.
July 30th 2018, 15:32:27.232 Stopping checkpoint coordinator for job
2a4eff355aef849c5ca37dbac04f2ff1
July 30th 2018, 15:32:27.239 Removed job graph
2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper.
July 30th 2018, 15:32:27.245 Removing
/flink/cluster_1/checkpoints/2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper
July 30th 2018, 15:32:27.251 Removing
/checkpoint-counter/2a4eff355aef849c5ca37dbac04f2ff1 from ZooKeeper

Both /flink/cluster_1/checkpoints/2a4eff355aef849c5ca37dbac04f2ff1
and /flink/cluster_1/checkpoint-counter/2a4eff355aef849c5ca37dbac04f2ff1 no
longer exist, but for some reason the job graph as is still there.

Looking at the ZK logs I find the problem:

July 30th 2018, 15:32:27.241 Got user-level KeeperException when processing
sessionid:0x2000001d2330001 type:delete cxid:0x434c zxid:0x60009dd94
txntype:-1 reqpath:n/a Error
Path:/flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1
Error:KeeperErrorCode = Directory not empty for
/flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1

Looking in ZK, we see:

[zk: localhost:2181(CONNECTED) 0] ls
/flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1
[d833418c-891a-4b5e-b983-080be803275c]

>From the comments in ZooKeeperStateHandleStore.java I gather that this
child node is used as a deletion lock.  Looking at the contents of this
ephemeral lock node:

[zk: localhost:2181(CONNECTED) 16] get
/flink/cluster_1/jobgraphs/2a4eff355aef849c5ca37dbac04f2ff1/d833418c-891a-4b5e-b983-080be803275c
*10.210.42.62*
cZxid = 0x60002ffa7
ctime = Tue Jun 12 20:01:26 UTC 2018
mZxid = 0x60002ffa7
mtime = Tue Jun 12 20:01:26 UTC 2018
pZxid = 0x60002ffa7
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x30000003f4a0003
dataLength = 12
numChildren = 0

and compared to the ephemeral node lock of the currently running job:

[zk: localhost:2181(CONNECTED) 17] get
/flink/cluster_1/jobgraphs/d77948df92813a68ea6dfd6783f40e7e/596a4add-9f5c-4113-99ec-9c942fe91172
*10.210.22.167*
cZxid = 0x60009df4b
ctime = Mon Jul 30 23:01:04 UTC 2018
mZxid = 0x60009df4b
mtime = Mon Jul 30 23:01:04 UTC 2018
pZxid = 0x60009df4b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x2000001d2330001
dataLength = 13
numChildren = 0

Assuming the content of the nodes represent the owner, it seems the job
graph for the old canceled job, 2a4eff355aef849c5ca37dbac04f2ff1, is locked
by the previous JM leader, JM 2(10.210.42.62), while the running job locked
by the current JM leader, JM 1 (10.210.22.167).

Somehow the previous leader, JM 2, did not give up the lock when leadership
failed over to JM 2.

Shouldn't something call ZooKeeperStateHandleStore.releaseAll during HA
failover to release the locks on the graphs?


On Wed, Aug 1, 2018 at 9:49 AM Elias Levy <fearsome.lucid...@gmail.com>
wrote:

> Thanks for the reply.  Looking in ZK I see:
>
> [zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs
> [d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1]
>
> Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even
> though that job is no longer running (it was canceled while it was in a
> loop attempting to restart, but failing because of a lack of cluster slots).
>
> Any idea why that may be the case?
>
>>

Reply via email to