[GitHub] flink pull request #2609: [FLINK-4717] Add CancelJobWithSavepoint
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2609 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2609: [FLINK-4717] Add CancelJobWithSavepoint
Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/2609#discussion_r82980824 --- Diff: docs/setup/cli.md --- @@ -113,6 +113,10 @@ The command line can be used to ./bin/flink cancel +- Cancel a job with a savepoint: + +./bin/flink cancel -s [targetDirectory] --- End diff -- We could also change the savepoint trigger command to `savepoint -t [targetDir] `. Or we could allow `cancel [targetDirectory]` in addition to the `-s` option variant, but that might be slightly confusing, too. We need the `-s` option without an argument for cancel with savepoint to the default target. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2609: [FLINK-4717] Add CancelJobWithSavepoint
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2609#discussion_r82805867 --- Diff: docs/setup/cli.md --- @@ -113,6 +113,10 @@ The command line can be used to ./bin/flink cancel +- Cancel a job with a savepoint: + +./bin/flink cancel -s [targetDirectory] --- End diff -- It is a little bit confusing that the `savepoint` command has a different argument order. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2609: [FLINK-4717] Add CancelJobWithSavepoint
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2609#discussion_r82814608 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -581,6 +581,62 @@ class JobManager( ) } +case CancelJobWithSavepoint(jobId, savepointDirectory) => + try { +val targetDirectory = if (savepointDirectory != null) { + savepointDirectory +} else { + defaultSavepointDir +} + +log.info(s"Trying to cancel job $jobId with savepoint to $targetDirectory") + +currentJobs.get(jobId) match { + case Some((executionGraph, _)) => +// We don't want any checkpoint between the savepoint and cancellation +val coord = executionGraph.getCheckpointCoordinator +coord.stopCheckpointScheduler() --- End diff -- I think it's not enough to simply call `stopCheckpointScheduler`. If I'm not mistaken, then the following could happen: You call `stopCheckpointScheduler` which will try to `cancel` the last `currentPeriodicTrigger`. Now assume that the last `TimerTask` to trigger the next checkpoint has just been triggered but not executed (just before cancelling it). Now the `stopCheckpointScheduler` finishes without the `TimerTask` having completed. Now the `TimerTask` can still trigger a checkpoint even though we've stopped the checkpoint scheduler. The way to fix this (admittedly academic corner case), is to filter out outdated `TimerTask` calls in the `CheckpointCoordinator` by having a kind of fencing tokens for the trigger checkpoint calls. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2609: [FLINK-4717] Add CancelJobWithSavepoint
GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2609 [FLINK-4717] Add CancelJobWithSavepoint This builds on top of #2608. - Adds CancelJobWithSavepoint message, which triggers a savepoint before cancelling the respective job. The periodic checkpoint scheduler is stopped before doing this so that no checkpoints happen between the savepoint and cancellation. Note that users could manually trigger another savepoint while cancel-with-savepoint happens, but I think this is tolerable. - Adds `-s [targetDirectory]` option to CLI cancel command: * Regular cancelling:`bin/flink cancel ` * Cancel with savepoint to default dir: `bin/flink cancel -s ` * Cancel with savepoint to target dir: `bin/flink cancel -s [targetDir] ` You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 4717-cancel_with_savepoint Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2609.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2609 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---