Hi Elias,

If a job is explicitly canceled, its jobgraph node on ZK will be deleted.
However, it is worth noting here that Flink enables a background thread to
asynchronously delete the jobGraph node,
so there may be cases where it cannot be deleted.
On the other hand, the jobgraph node on ZK is the only basis for the JM
leader to restore the job.
There may be an unexpected recovery or an old job resurrection.

Thanks, vino.

2018-08-01 23:13 GMT+08:00 Elias Levy <fearsome.lucid...@gmail.com>:

> For the second time in as many months we've had an old job resurrected
> during HA failover in a 1.4.2 standalone cluster.  Failover was initiated
> when the leading JM lost its connection to ZK.  I opened FLINK-10011
> <https://issues.apache.org/jira/browse/FLINK-10011> with the details.
>
> We are using S3 with the Presto adapter as our distributed store.  After
> we cleaned up the cluster by shutting down the two jobs started after
> failover and starting a new job from the last known good checkpoint from
> the single job running in the cluster before failover, the HA recovery
> directory looks as follows:
>
> 3cmd ls s3://bucket/flink/cluster_1/recovery/
>  DIR s3://bucket/flink/cluster_1/recovery/some_job/}}
> 2018-07-31 17:33 35553 s3://bucket/flink/cluster_1/recovery/
> completedCheckpoint12e06bef01c5
> 2018-07-31 17:34 35553 s3://bucket/flink/cluster_1/recovery/
> completedCheckpoint187e0d2ae7cb
> 2018-07-31 17:32 35553 s3://bucket/flink/cluster_1/recovery/
> completedCheckpoint22fc8ca46f02
> 2018-06-12 20:01 284626 s3://bucket/flink/cluster_1/recovery/
> submittedJobGraph7f627a661cec
> 2018-07-30 23:01 285257 s3://bucket/flink/cluster_1/recovery/
> submittedJobGraphf3767780c00c
>
> submittedJobGraph7f627a661cec appears to be job
> 2a4eff355aef849c5ca37dbac04f2ff1, the long running job that failed during
> the ZK failover
>
> submittedJobGraphf3767780c00c appears to be job
> d77948df92813a68ea6dfd6783f40e7e, the job we started restoring from a
> checkpoint after shutting down the duplicate jobs
>
> Should submittedJobGraph7f627a661cec exist in the recovery directory if
> 2a4eff355aef849c5ca37dbac04f2ff1 is no longer running?
>
>
>

Reply via email to