Hi Elias and Vino,
sorry for the late reply.
I think your analysis is pretty much to the point. The current
implementation does not properly respect the situation with multiple
standby JobManagers. In the single JobManager case, a loss of leadership
either means that the JobManager has died and,
Till,
Thoughts?
On Wed, Aug 1, 2018 at 7:34 PM vino yang wrote:
> Your analysis is correct, yes, in theory the old jobgraph should be
> deleted, but Flink currently uses the method of locking and asynchronously
> deleting Path, so that it can not give you the acknowledgment of deleting,
> so
Hi Elias,
Your analysis is correct, yes, in theory the old jobgraph should be
deleted, but Flink currently uses the method of locking and asynchronously
deleting Path, so that it can not give you the acknowledgment of deleting,
so this is a risk point.
cc Till, there have been users who have
I can see in the logs that the JM 1 (10.210.22.167), that one that became
leader after failover, thinks it deleted
the 2a4eff355aef849c5ca37dbac04f2ff1 job from ZK when it was canceled:
July 30th 2018, 15:32:27.231 Trying to cancel job with ID
2a4eff355aef849c5ca37dbac04f2ff1.
July 30th 2018,
Vino,
Thanks for the reply. Looking in ZK I see:
[zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs
[d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1]
Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though
that job is no longer running (it
is the only basis for the JM
leader to restore the job.
There may be an unexpected recovery or an old job resurrection.
Thanks, vino.
2018-08-01 23:13 GMT+08:00 Elias Levy :
> For the second time in as many months we've had an old job resurrected
> during HA failover in a 1.4.2 standalone c
For the second time in as many months we've had an old job resurrected
during HA failover in a 1.4.2 standalone cluster. Failover was initiated
when the leading JM lost its connection to ZK. I opened FLINK-10011
<https://issues.apache.org/jira/browse/FLINK-10011> with the details.
We are