[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/6279 ---
[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/6279#discussion_r201869163 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java --- @@ -536,7 +540,25 @@ private JobManagerRunner createJobManagerRunner(JobGraph jobGraph) throws Except private void removeJobAndRegisterTerminationFuture(JobID jobId, boolean cleanupHA) { final CompletableFuture cleanupFuture = removeJob(jobId, cleanupHA); - registerOrphanedJobManagerTerminationFuture(cleanupFuture); + registerJobManagerRunnerTerminationFuture(jobId, cleanupFuture); + } + + private void registerJobManagerRunnerTerminationFuture(JobID jobId, CompletableFuture jobManagerRunnerTerminationFuture) { + Preconditions.checkState(!jobManagerTerminationFutures.containsKey(jobId)); + + jobManagerTerminationFutures.put(jobId, jobManagerRunnerTerminationFuture); + + // clean up the pending termination future + jobManagerRunnerTerminationFuture.thenRunAsync( + () -> { + final CompletableFuture terminationFuture = jobManagerTerminationFutures.remove(jobId); + + //noinspection ObjectEquality + if (terminationFuture != null && terminationFuture != jobManagerRunnerTerminationFuture) { + jobManagerTerminationFutures.put(jobId, terminationFuture); --- End diff -- It can happen because we also clear the termination future in the callback of the `Dispatcher#waitForTerminatingJobManager` method. ---
[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...
Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/6279#discussion_r201329956 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java --- @@ -536,7 +540,25 @@ private JobManagerRunner createJobManagerRunner(JobGraph jobGraph) throws Except private void removeJobAndRegisterTerminationFuture(JobID jobId, boolean cleanupHA) { final CompletableFuture cleanupFuture = removeJob(jobId, cleanupHA); - registerOrphanedJobManagerTerminationFuture(cleanupFuture); + registerJobManagerRunnerTerminationFuture(jobId, cleanupFuture); + } + + private void registerJobManagerRunnerTerminationFuture(JobID jobId, CompletableFuture jobManagerRunnerTerminationFuture) { + Preconditions.checkState(!jobManagerTerminationFutures.containsKey(jobId)); + + jobManagerTerminationFutures.put(jobId, jobManagerRunnerTerminationFuture); + + // clean up the pending termination future + jobManagerRunnerTerminationFuture.thenRunAsync( + () -> { + final CompletableFuture terminationFuture = jobManagerTerminationFutures.remove(jobId); + + //noinspection ObjectEquality + if (terminationFuture != null && terminationFuture != jobManagerRunnerTerminationFuture) { + jobManagerTerminationFutures.put(jobId, terminationFuture); --- End diff -- Here you handle a case where a terminationFuture for a job got replaced. Under what circumstances can this happen? Doesn't the `checkState`: `Preconditions.checkState(!jobManagerTerminationFutures.containsKey(jobId));` prevent this? ---
[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...
GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/6279 [FLINK-9706] Properly wait for termination of JobManagerRunner before restarting jobs ## What is the purpose of the change In order to avoid race conditions between resource clean up, we now wait for the proper termination of a previously running JobMaster responsible for the same job (e.g. originating from a job recovery or a re-submission). This PR also fixes [FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439). ## Brief change log - Cache per `JobManagerRunner` the termination future - Before submitting a job wait for the termination of a previously running `JobManagerRunner` responsible for the same `JobID` ## Verifying this change - Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and `DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination` - Before `DispatcherTest#testJobRecovery` and `DispatcherTest#testSubmittedJobGraphListener` failed due to not properly waiting for the termination ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixJobManagerRunnerTermination Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6279 commit 0e3a19cfa083030f81458dfd36f9bab32d64577a Author: Till Rohrmann Date: 2018-07-06T10:38:25Z [hotfix] Exclude generated Avro types in flink-confluent-schema-registry from rat check commit a5d9ff2c16b47b87efc469196c320bd7ba492a95 Author: Till Rohrmann Date: 2018-07-07T08:53:38Z [FLINK-9706] Properly wait for termination of JobManagerRunner before restarting jobs In order to avoid race conditions between resource clean up, we now wait for the proper termination of a previously running JobMaster responsible for the same job (e.g. originating from a job recovery or a re-submission). ---