[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...

2018-07-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/6279


---


[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...

2018-07-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6279#discussion_r201869163
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java 
---
@@ -536,7 +540,25 @@ private JobManagerRunner 
createJobManagerRunner(JobGraph jobGraph) throws Except
private void removeJobAndRegisterTerminationFuture(JobID jobId, boolean 
cleanupHA) {
final CompletableFuture cleanupFuture = removeJob(jobId, 
cleanupHA);
 
-   registerOrphanedJobManagerTerminationFuture(cleanupFuture);
+   registerJobManagerRunnerTerminationFuture(jobId, cleanupFuture);
+   }
+
+   private void registerJobManagerRunnerTerminationFuture(JobID jobId, 
CompletableFuture jobManagerRunnerTerminationFuture) {
+   
Preconditions.checkState(!jobManagerTerminationFutures.containsKey(jobId));
+
+   jobManagerTerminationFutures.put(jobId, 
jobManagerRunnerTerminationFuture);
+
+   // clean up the pending termination future
+   jobManagerRunnerTerminationFuture.thenRunAsync(
+   () -> {
+   final CompletableFuture terminationFuture 
= jobManagerTerminationFutures.remove(jobId);
+
+   //noinspection ObjectEquality
+   if (terminationFuture != null && 
terminationFuture != jobManagerRunnerTerminationFuture) {
+   jobManagerTerminationFutures.put(jobId, 
terminationFuture);
--- End diff --

It can happen because we also clear the termination future in the callback 
of the `Dispatcher#waitForTerminatingJobManager` method.


---


[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...

2018-07-10 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/6279#discussion_r201329956
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java 
---
@@ -536,7 +540,25 @@ private JobManagerRunner 
createJobManagerRunner(JobGraph jobGraph) throws Except
private void removeJobAndRegisterTerminationFuture(JobID jobId, boolean 
cleanupHA) {
final CompletableFuture cleanupFuture = removeJob(jobId, 
cleanupHA);
 
-   registerOrphanedJobManagerTerminationFuture(cleanupFuture);
+   registerJobManagerRunnerTerminationFuture(jobId, cleanupFuture);
+   }
+
+   private void registerJobManagerRunnerTerminationFuture(JobID jobId, 
CompletableFuture jobManagerRunnerTerminationFuture) {
+   
Preconditions.checkState(!jobManagerTerminationFutures.containsKey(jobId));
+
+   jobManagerTerminationFutures.put(jobId, 
jobManagerRunnerTerminationFuture);
+
+   // clean up the pending termination future
+   jobManagerRunnerTerminationFuture.thenRunAsync(
+   () -> {
+   final CompletableFuture terminationFuture 
= jobManagerTerminationFutures.remove(jobId);
+
+   //noinspection ObjectEquality
+   if (terminationFuture != null && 
terminationFuture != jobManagerRunnerTerminationFuture) {
+   jobManagerTerminationFutures.put(jobId, 
terminationFuture);
--- End diff --

Here you handle a case where a terminationFuture for a job got replaced. 
Under what circumstances can this happen? Doesn't the `checkState`: 
`Preconditions.checkState(!jobManagerTerminationFutures.containsKey(jobId));` 
prevent this?


---


[GitHub] flink pull request #6279: [FLINK-9706] Properly wait for termination of JobM...

2018-07-07 Thread tillrohrmann
GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/6279

[FLINK-9706] Properly wait for termination of JobManagerRunner before 
restarting jobs

## What is the purpose of the change

In order to avoid race conditions between resource clean up, we now wait 
for the proper
termination of a previously running JobMaster responsible for the same job 
(e.g. originating
from a job recovery or a re-submission).

This PR also fixes 
[FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439).

## Brief change log

- Cache per `JobManagerRunner` the termination future
- Before submitting a job wait for the termination of a previously running 
`JobManagerRunner` responsible for the same `JobID`

## Verifying this change

- Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and 
`DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination`
- Before `DispatcherTest#testJobRecovery` and 
`DispatcherTest#testSubmittedJobGraphListener` failed due to not properly 
waiting for the termination

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
fixJobManagerRunnerTermination

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6279


commit 0e3a19cfa083030f81458dfd36f9bab32d64577a
Author: Till Rohrmann 
Date:   2018-07-06T10:38:25Z

[hotfix] Exclude generated Avro types in flink-confluent-schema-registry 
from rat check

commit a5d9ff2c16b47b87efc469196c320bd7ba492a95
Author: Till Rohrmann 
Date:   2018-07-07T08:53:38Z

[FLINK-9706] Properly wait for termination of JobManagerRunner before 
restarting jobs

In order to avoid race conditions between resource clean up, we now wait 
for the proper
termination of a previously running JobMaster responsible for the same job 
(e.g. originating
from a job recovery or a re-submission).




---