[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/4933 ---
[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/4933#discussion_r148498481 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java --- @@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || current == DEPLOYING) { // failing in the meantime may happen and is no problem. // anything else is a serious problem !!! if (current != FAILED) { - String message = String.format("Asynchronous race: Found state %s after successful cancel call.", state); + String message = String.format("Asynchronous race: Found %s in state %s after successful cancel call.", vertex.getTaskNameWithSubtaskIndex(), state); LOG.error(message); --- End diff -- Ok ---
[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/4933#discussion_r148399427 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java --- @@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || current == DEPLOYING) { // failing in the meantime may happen and is no problem. // anything else is a serious problem !!! if (current != FAILED) { - String message = String.format("Asynchronous race: Found state %s after successful cancel call.", state); + String message = String.format("Asynchronous race: Found %s in state %s after successful cancel call.", vertex.getTaskNameWithSubtaskIndex(), state); LOG.error(message); --- End diff -- Not really, because we reuse the message in the line below. Moreover, the logging statement is error and thus, will be evaluated in almost all cases. What one could argue is whether normal string concatenation wouldn't be faster than `String.format`. ---
[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/4933#discussion_r148399116 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/utils/SimpleAckingTaskManagerGateway.java --- @@ -48,6 +48,8 @@ private OptionaloptSubmitCondition; + private Optional optCancelCondition; --- End diff -- Jup it is. Fixed it. ---
[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/4933#discussion_r148306365 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java --- @@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || current == DEPLOYING) { // failing in the meantime may happen and is no problem. // anything else is a serious problem !!! if (current != FAILED) { - String message = String.format("Asynchronous race: Found state %s after successful cancel call.", state); + String message = String.format("Asynchronous race: Found %s in state %s after successful cancel call.", vertex.getTaskNameWithSubtaskIndex(), state); LOG.error(message); --- End diff -- nit: slf4j's `{}` placeholders should be used. ---
[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/4933#discussion_r148305773 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/utils/SimpleAckingTaskManagerGateway.java --- @@ -48,6 +48,8 @@ private OptionaloptSubmitCondition; + private Optional optCancelCondition; --- End diff -- This will be always `null` initialized. Isn't that a problem? ---
[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...
GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/4933 [FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices ## What is the purpose of the change One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the subsequent completeCancellingForAllVertices where not all execution are in state CANCELLING. The other race condition is between completeCancellingForAllVertices and the fixed delay restart without a delay. The problem is that the 10th task could have failed. In order to restart we would have to complete the cancel for the first 9 tasks. This is enough for the restart strategy to restart the job. If this happens before completeCancellingForAllVertices has also cancelled the execution of the 10th task, it could happen that we cancel a fresh execution. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink hardenExecutionGraphRestartTest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4933.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4933 commit 76659cce74afb3045164c522a91b1b1688e34f38 Author: Till RohrmannDate: 2017-11-01T15:23:51Z [hotfix] Make WaitForTasks using an AtomicInteger commit b1701e31305b05488a6ff6b0c305193a13a68637 Author: Till Rohrmann Date: 2017-11-01T15:53:14Z [FLINK-7352] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the subsequent completeCancellingForAllVertices where not all execution are in state CANCELLING. The other race condition is between completeCancellingForAllVertices and the fixed delay restart without a delay. The problem is that the 10th task could have failed. In order to restart we would have to complete the cancel for the first 9 tasks. This is enough for the restart strategy to restart the job. If this happens before completeCancellingForAllVertices has also cancelled the execution of the 10th task, it could happen that we cancel a fresh execution. ---