Re: Old job resurrected during HA failover

2018-08-03 Thread Till Rohrmann
Hi Elias and Vino, sorry for the late reply. I think your analysis is pretty much to the point. The current implementation does not properly respect the situation with multiple standby JobManagers. In the single JobManager case, a loss of leadership either means that the JobManager has died and,

Re: Old job resurrected during HA failover

2018-08-03 Thread Elias Levy
Till, Thoughts? On Wed, Aug 1, 2018 at 7:34 PM vino yang wrote: > Your analysis is correct, yes, in theory the old jobgraph should be > deleted, but Flink currently uses the method of locking and asynchronously > deleting Path, so that it can not give you the acknowledgment of deleting, > so

Re: Old job resurrected during HA failover

2018-08-01 Thread vino yang
Hi Elias, Your analysis is correct, yes, in theory the old jobgraph should be deleted, but Flink currently uses the method of locking and asynchronously deleting Path, so that it can not give you the acknowledgment of deleting, so this is a risk point. cc Till, there have been users who have

Re: Old job resurrected during HA failover

2018-08-01 Thread Elias Levy
I can see in the logs that the JM 1 (10.210.22.167), that one that became leader after failover, thinks it deleted the 2a4eff355aef849c5ca37dbac04f2ff1 job from ZK when it was canceled: July 30th 2018, 15:32:27.231 Trying to cancel job with ID 2a4eff355aef849c5ca37dbac04f2ff1. July 30th 2018,

Re: Old job resurrected during HA failover

2018-08-01 Thread Elias Levy
Vino, Thanks for the reply. Looking in ZK I see: [zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs [d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1] Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though that job is no longer running (it

Re: Old job resurrected during HA failover

2018-08-01 Thread vino yang
is the only basis for the JM leader to restore the job. There may be an unexpected recovery or an old job resurrection. Thanks, vino. 2018-08-01 23:13 GMT+08:00 Elias Levy : > For the second time in as many months we've had an old job resurrected > during HA failover in a 1.4.2 standalone c

Old job resurrected during HA failover

2018-08-01 Thread Elias Levy
For the second time in as many months we've had an old job resurrected during HA failover in a 1.4.2 standalone cluster. Failover was initiated when the leading JM lost its connection to ZK. I opened FLINK-10011 <https://issues.apache.org/jira/browse/FLINK-10011> with the details. We are