For the second time in as many months we've had an old job resurrected
during HA failover in a 1.4.2 standalone cluster.  Failover was initiated
when the leading JM lost its connection to ZK.  I opened FLINK-10011
<https://issues.apache.org/jira/browse/FLINK-10011> with the details.

We are using S3 with the Presto adapter as our distributed store.  After we
cleaned up the cluster by shutting down the two jobs started after failover
and starting a new job from the last known good checkpoint from the single
job running in the cluster before failover, the HA recovery directory looks
as follows:

3cmd ls s3://bucket/flink/cluster_1/recovery/
 DIR s3://bucket/flink/cluster_1/recovery/some_job/}}
2018-07-31 17:33 35553
s3://bucket/flink/cluster_1/recovery/completedCheckpoint12e06bef01c5
2018-07-31 17:34 35553
s3://bucket/flink/cluster_1/recovery/completedCheckpoint187e0d2ae7cb
2018-07-31 17:32 35553
s3://bucket/flink/cluster_1/recovery/completedCheckpoint22fc8ca46f02
2018-06-12 20:01 284626
s3://bucket/flink/cluster_1/recovery/submittedJobGraph7f627a661cec
2018-07-30 23:01 285257
s3://bucket/flink/cluster_1/recovery/submittedJobGraphf3767780c00c

submittedJobGraph7f627a661cec appears to be job
2a4eff355aef849c5ca37dbac04f2ff1, the long running job that failed during
the ZK failover

submittedJobGraphf3767780c00c appears to be job
d77948df92813a68ea6dfd6783f40e7e, the job we started restoring from a
checkpoint after shutting down the duplicate jobs

Should submittedJobGraph7f627a661cec exist in the recovery directory if
2a4eff355aef849c5ca37dbac04f2ff1 is no longer running?

Reply via email to