[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121318#comment-16121318
 ] 

ASF GitHub Bot commented on FLINK-7352:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/4501


> ExecutionGraphRestartTest timeouts
> --
>
> Key: FLINK-7352
> URL: https://issues.apache.org/jira/browse/FLINK-7352
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> Recently, I received timeouts from some tests in 
> {{ExecutionGraphRestartTest}} like this
> {code}
> Tests in error: 
>   ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout
> {code}
> This particular instance is from 1.3.2 RC2 and stuck in 
> {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also 
> had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121296#comment-16121296
 ] 

ASF GitHub Bot commented on FLINK-7352:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/4501
  
Travis passes. Merging this PR.


> ExecutionGraphRestartTest timeouts
> --
>
> Key: FLINK-7352
> URL: https://issues.apache.org/jira/browse/FLINK-7352
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> Recently, I received timeouts from some tests in 
> {{ExecutionGraphRestartTest}} like this
> {code}
> Tests in error: 
>   ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout
> {code}
> This particular instance is from 1.3.2 RC2 and stuck in 
> {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also 
> had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119545#comment-16119545
 ] 

ASF GitHub Bot commented on FLINK-7352:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/4501

[FLINK-7352] [tests] Stabilize ExecutionGraphRestartTest

## What is the purpose of the change

Introduce an explicit waiting for the deployment of tasks. This replaces 
the loose
ordering induced by Thread.sleep and fixes the race conditions caused by it.

## Brief change log

- Introduce `WaitForTasks` consumer which is given to the 
`SimpleAckingTaskManagerGateway`
- Using a single `SimpleAckingTaskManagerGateway` to receive all task 
submission calls

## Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
fixExecutionGraphRestartTest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4501.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4501


commit 40cd0c860dd600ce2baa69b0f0ba8cf7a787ff63
Author: Till Rohrmann 
Date:   2017-08-09T07:57:56Z

[FLINK-7352] [tests] Stabilize ExecutionGraphRestartTest

Introduce an explicit waiting for the deployment of tasks. This replaces 
the loose
ordering induced by Thread.sleep and fixes the race conditions caused by it.




> ExecutionGraphRestartTest timeouts
> --
>
> Key: FLINK-7352
> URL: https://issues.apache.org/jira/browse/FLINK-7352
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> Recently, I received timeouts from some tests in 
> {{ExecutionGraphRestartTest}} like this
> {code}
> Tests in error: 
>   ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout
> {code}
> This particular instance is from 1.3.2 RC2 and stuck in 
> {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also 
> had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-09 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119492#comment-16119492
 ] 

Till Rohrmann commented on FLINK-7352:
--

I think [~StephanEwen] is right and the problem is 
https://github.com/apache/flink/blob/master/flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphTestUtils.java#L203.
 You can simulate it by removing the sleep and introducing a small sleep in 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L401.

I think the solution would be to wait on the {{SimpleAckingTaskManagerGateway}} 
until it has received all task submissions before switching the {{Executions}} 
to running.

> ExecutionGraphRestartTest timeouts
> --
>
> Key: FLINK-7352
> URL: https://issues.apache.org/jira/browse/FLINK-7352
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
>
> Recently, I received timeouts from some tests in 
> {{ExecutionGraphRestartTest}} like this
> {code}
> Tests in error: 
>   ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout
> {code}
> This particular instance is from 1.3.2 RC2 and stuck in 
> {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also 
> had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-08 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118372#comment-16118372
 ] 

Till Rohrmann commented on FLINK-7352:
--

Another instance: https://travis-ci.org/apache/flink/jobs/262140336

> ExecutionGraphRestartTest timeouts
> --
>
> Key: FLINK-7352
> URL: https://issues.apache.org/jira/browse/FLINK-7352
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
>
> Recently, I received timeouts from some tests in 
> {{ExecutionGraphRestartTest}} like this
> {code}
> Tests in error: 
>   ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout
> {code}
> This particular instance is from 1.3.2 RC2 and stuck in 
> {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also 
> had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110836#comment-16110836
 ] 

ASF GitHub Bot commented on FLINK-7352:
---

Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/4451
  
Please see https://issues.apache.org/jira/browse/FLINK-7352 and it seems 
you are right - increased timeouts do not solve this issue.


> ExecutionGraphRestartTest timeouts
> --
>
> Key: FLINK-7352
> URL: https://issues.apache.org/jira/browse/FLINK-7352
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
>
> Recently, I received timeouts from some tests in 
> {{ExecutionGraphRestartTest}} like this
> {code}
> Tests in error: 
>   ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout
> {code}
> This particular instance is from 1.3.2 RC2 and stuck in 
> {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also 
> had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7352) ExecutionGraphRestartTest timeouts

2017-08-02 Thread Nico Kruber (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110830#comment-16110830
 ] 

Nico Kruber commented on FLINK-7352:


 another run with the following snipped and the failure is reproducible:

{code}
private static Logger LOG = 
LoggerFactory.getLogger(ExecutionGraphRestartTest.class);

@Test
public void testConcurrentLocalFailAndRestart1000() throws Exception {
for (int i = 0; i < 1000; ++i) {
LOG.info("starting test run " + i);
testConcurrentLocalFailAndRestart();
}
}
{code}

{code}
14:48:28,242 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Job recovers via failover strategy: full graph restart
14:48:28,243 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraphTestUtils  - Running 
initialization on master for job test job (102d4baecd0231e60647da78ee3d7bb6).
14:48:28,243 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraphTestUtils  - Successfully 
ran initialization on master in 0 ms.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Job test job (102d4baecd0231e60647da78ee3d7bb6) switched from state CREATED 
to RUNNING.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (1/10) (555363d1489a0855bdd515635023df98) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (2/10) (29f7de673c17d484836a64bf2a0f38fb) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (3/10) (1958f95034522b1804ce9941244c4729) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (4/10) (d8c7216766e957b04a445b2f81e5bac2) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (5/10) (956d9cc0b60979387c25b2354f78f392) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (6/10) (d25fbc105698617b4fbb2f643e427c4a) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (7/10) (05fd3e4718a959fb2ecd337cc0ca0d72) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (8/10) (aac6ca828da7e204710cd56db5574b9e) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (9/10) (25e5a4296527b9fe14462d1be737b4df) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (10/10) (2437fd4ac959d003972606625371717c) switched from CREATED to 
SCHEDULED.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (1/10) (555363d1489a0855bdd515635023df98) switched from SCHEDULED to 
DEPLOYING.
14:48:28,243 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Deploying vertex (1/10) (attempt #0) to localhost
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (2/10) (29f7de673c17d484836a64bf2a0f38fb) switched from SCHEDULED to 
DEPLOYING.
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Deploying vertex (2/10) (attempt #0) to localhost
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (3/10) (1958f95034522b1804ce9941244c4729) switched from SCHEDULED to 
DEPLOYING.
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Deploying vertex (3/10) (attempt #0) to localhost
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (4/10) (d8c7216766e957b04a445b2f81e5bac2) switched from SCHEDULED to 
DEPLOYING.
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Deploying vertex (4/10) (attempt #0) to localhost
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (5/10) (956d9cc0b60979387c25b2354f78f392) switched from SCHEDULED to 
DEPLOYING.
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Deploying vertex (5/10) (attempt #0) to localhost
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (6/10) (d25fbc105698617b4fbb2f643e427c4a) switched from SCHEDULED to 
DEPLOYING.
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Deploying vertex (6/10) (attempt #0) to localhost
14:48:28,244 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - vertex (7/10)