[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/4933


---


[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-02 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/4933#discussion_r148498481
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
 ---
@@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || 
current == DEPLOYING) {
// failing in the meantime may happen and is no 
problem.
// anything else is a serious problem !!!
if (current != FAILED) {
-   String message = 
String.format("Asynchronous race: Found state %s after successful cancel 
call.", state);
+   String message = 
String.format("Asynchronous race: Found %s in state %s after successful cancel 
call.", vertex.getTaskNameWithSubtaskIndex(), state);
LOG.error(message);
--- End diff --

Ok


---


[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-01 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/4933#discussion_r148399427
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
 ---
@@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || 
current == DEPLOYING) {
// failing in the meantime may happen and is no 
problem.
// anything else is a serious problem !!!
if (current != FAILED) {
-   String message = 
String.format("Asynchronous race: Found state %s after successful cancel 
call.", state);
+   String message = 
String.format("Asynchronous race: Found %s in state %s after successful cancel 
call.", vertex.getTaskNameWithSubtaskIndex(), state);
LOG.error(message);
--- End diff --

Not really, because we reuse the message in the line below. Moreover, the 
logging statement is error and thus, will be evaluated in almost all cases. 
What one could argue is whether normal string concatenation wouldn't be faster 
than `String.format`.


---


[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-01 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/4933#discussion_r148399116
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/utils/SimpleAckingTaskManagerGateway.java
 ---
@@ -48,6 +48,8 @@
 
private Optional optSubmitCondition;
 
+   private Optional optCancelCondition;
--- End diff --

Jup it is. Fixed it.


---


[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-01 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/4933#discussion_r148306365
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
 ---
@@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || 
current == DEPLOYING) {
// failing in the meantime may happen and is no 
problem.
// anything else is a serious problem !!!
if (current != FAILED) {
-   String message = 
String.format("Asynchronous race: Found state %s after successful cancel 
call.", state);
+   String message = 
String.format("Asynchronous race: Found %s in state %s after successful cancel 
call.", vertex.getTaskNameWithSubtaskIndex(), state);
LOG.error(message);
--- End diff --

nit: slf4j's `{}` placeholders should be used.


---


[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-01 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/4933#discussion_r148305773
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/utils/SimpleAckingTaskManagerGateway.java
 ---
@@ -48,6 +48,8 @@
 
private Optional optSubmitCondition;
 
+   private Optional optCancelCondition;
--- End diff --

This will be always `null` initialized. Isn't that a problem?


---


[GitHub] flink pull request #4933: [FLINK-7960] [tests] Fix race conditions in Execut...

2017-11-01 Thread tillrohrmann
GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/4933

[FLINK-7960] [tests] Fix race conditions in 
ExecutionGraphRestartTest#completeCancellingForAllVertices

## What is the purpose of the change

One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 
1000) and the
subsequent completeCancellingForAllVertices where not all execution are in 
state
CANCELLING.

The other race condition is between completeCancellingForAllVertices and 
the fixed
delay restart without a delay. The problem is that the 10th task could have 
failed.
In order to restart we would have to complete the cancel for the first 9 
tasks. This
is enough for the restart strategy to restart the job. If this happens 
before
completeCancellingForAllVertices has also cancelled the execution of the 
10th task,
it could happen that we cancel a fresh execution.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
hardenExecutionGraphRestartTest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4933.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4933


commit 76659cce74afb3045164c522a91b1b1688e34f38
Author: Till Rohrmann 
Date:   2017-11-01T15:23:51Z

[hotfix] Make WaitForTasks using an AtomicInteger

commit b1701e31305b05488a6ff6b0c305193a13a68637
Author: Till Rohrmann 
Date:   2017-11-01T15:53:14Z

[FLINK-7352] [tests] Fix race conditions in 
ExecutionGraphRestartTest#completeCancellingForAllVertices

One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 
1000) and the
subsequent completeCancellingForAllVertices where not all execution are in 
state
CANCELLING.

The other race condition is between completeCancellingForAllVertices and 
the fixed
delay restart without a delay. The problem is that the 10th task could have 
failed.
In order to restart we would have to complete the cancel for the first 9 
tasks. This
is enough for the restart strategy to restart the job. If this happens 
before
completeCancellingForAllVertices has also cancelled the execution of the 
10th task,
it could happen that we cancel a fresh execution.




---