[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544162#comment-15544162 ] haosdent commented on MESOS-6180: - Thanks a lot for [~greggomann]'s helps! > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > RoleTest.ImplicitRoleRegister.txt, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543283#comment-15543283 ] Greg Mann commented on MESOS-6180: -- AFAICT, these errors are due to performance issues on the AWS instances we're using for our CI. I'm closing this ticket and the linked tickets for now. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > RoleTest.ImplicitRoleRegister.txt, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508943#comment-15508943 ] haosdent commented on MESOS-6180: - Awesome! Thanks a lot for your helps! > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > RoleTest.ImplicitRoleRegister.txt, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508838#comment-15508838 ] Greg Mann commented on MESOS-6180: -- Another common error seen when this issue manifests is: {code} Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins {code} See the file {{RoleTest.ImplicitRoleRegister.txt}} for the full test log. [~haosd...@gmail.com], there is a review [here|https://reviews.apache.org/r/41665/] proposing the {{in_memory}} registry for tests. I'm currently trying to figure out whether this is a legitimate bug or simply the result of an unreasonable load put on the machine. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > RoleTest.ImplicitRoleRegister.txt, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507560#comment-15507560 ] haosdent commented on MESOS-6180: - Highly appreciate [~greggomann] to help reproduce in my AWS instance!!! The reason why I couldn't reproduce before is I run {{stress}} and {{mesos-tests}} in the separate disk which different with the root disk. So the {{stress}} didn't affect the root filesystem Linux used. If I run {{stress}} in the root disk and run {{mesos-test}} in the separate disk, then it could reproduce in few test iterations. A workaround for this is to use {{flags.registry = "in_memory"}} when run tests which I have not reproduce the errors after use it. But now I think the test cases failure should be expected because the root filesystem could not work as normal. Do you think we should use {{flags.registry = "in_memory"}} or just ignore these failures? cc [~jieyu] [~vinodkone] [~kaysoky] > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500170#comment-15500170 ] haosdent commented on MESOS-6180: - Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 14.04), but could not reproduce as well. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499226#comment-15499226 ] haosdent commented on MESOS-6180: - | Waited for agent finish registering in test case. | https://reviews.apache.org/r/51985/ | > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499128#comment-15499128 ] haosdent commented on MESOS-6180: - It looks like it blocked at {{ReplicaProcess::write}} which need to read from leveldb. I have not yet reproduced with stress after 3000 iterations at both physical machine (4 cpus, 32 gb mem, Ubuntu 14.04) and my local virtualbox vm (4 cpus, 8 gb mem). > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497951#comment-15497951 ] Greg Mann commented on MESOS-6180: -- Thanks for the patch to address the mount leak [~jieyu]! (https://reviews.apache.org/r/51963/) I ran {{sudo MESOS_VERBOSE=1 GLOG_v=2 GTEST_REPEAT=-1 GTEST_BREAK_ON_FAILURE=1 GTEST_FILTER="*MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespace*" bin/mesos-tests.sh}} and stressed my machine with {{stress -c N -i N -m N -d 1}}, where {{N}} is number of cores, and I was able to reproduce a couple of these offer future timeout failures after a few tens of repetitions. I attached logs above as {{flaky-containerizer-pid-namespace-forward.txt}} and {{flaky-containerizer-pid-namespace-backward.txt}}. We can see the master beginning agent registration, but we never see the line {{Registered agent ...}} from {{Master::_registerSlave()}}, which indicates that registration is complete and the registered message has been sent to the agent: {code} I0917 01:35:17.184216 480 master.cpp:4886] Registering agent at slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) with id fa7a42d0-5d0c-4799-b19f-2a85b43039f3-S0 I0917 01:35:17.184232 474 process.cpp:2707] Resuming __reaper__(1)@172.31.1.104:57341 at 2016-09-17 01:35:17.184222976+00:00 I0917 01:35:17.184377 474 process.cpp:2707] Resuming registrar(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184371968+00:00 I0917 01:35:17.184554 474 registrar.cpp:464] Applied 1 operations in 79217ns; attempting to update the registry I0917 01:35:17.184953 474 process.cpp:2697] Spawned process __latch__(141)@172.31.1.104:57341 I0917 01:35:17.184990 485 process.cpp:2707] Resuming log-storage(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184982016+00:00 I0917 01:35:17.185561 485 process.cpp:2707] Resuming log-writer(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185552896+00:00 I0917 01:35:17.185609 485 log.cpp:577] Attempting to append 434 bytes to the log I0917 01:35:17.185804 485 process.cpp:2707] Resuming log-coordinator(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185797888+00:00 I0917 01:35:17.185863 485 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 3 I0917 01:35:17.185998 485 process.cpp:2697] Spawned process log-write(29)@172.31.1.104:57341 I0917 01:35:17.186030 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186021888+00:00 I0917 01:35:17.186189 475 process.cpp:2707] Resuming log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186182912+00:00 I0917 01:35:17.186275 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186267904+00:00 I0917 01:35:17.186424 475 process.cpp:2707] Resuming log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186416896+00:00 I0917 01:35:17.186575 475 process.cpp:2697] Spawned process __req_res__(55)@172.31.1.104:57341 I0917 01:35:17.186724 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186717952+00:00 I0917 01:35:17.186609 485 process.cpp:2707] Resuming __req_res__(55)@172.31.1.104:57341 at 2016-09-17 01:35:17.186601984+00:00 I0917 01:35:17.186898 485 process.cpp:2707] Resuming log-replica(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186892032+00:00 I0917 01:35:17.186962 485 replica.cpp:537] Replica received write request for position 3 from __req_res__(55)@172.31.1.104:57341 I0917 01:35:17.185014 471 process.cpp:2707] Resuming __gc__@172.31.1.104:57341 at 2016-09-17 01:35:17.185008896+00:00 I0917 01:35:17.185036 480 process.cpp:2707] Resuming __latch__(141)@172.31.1.104:57341 at 2016-09-17 01:35:17.185029120+00:00 I0917 01:35:17.196358 482 process.cpp:2707] Resuming slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.196335104+00:00 I0917 01:35:17.196900 482 slave.cpp:1471] Will retry registration in 25.224033ms if necessary I0917 01:35:17.197029 482 process.cpp:2707] Resuming master@172.31.1.104:57341 at 2016-09-17 01:35:17.197024000+00:00 I0917 01:35:17.197157 482 master.cpp:4874] Ignoring register agent message from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) as admission is already in progress I0917 01:35:17.224309 482 process.cpp:2707] Resuming slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.224284928+00:00 I0917 01:35:17.224845 482 slave.cpp:1471] Will retry registration in 63.510932ms if necessary I0917 01:35:17.224900 475 process.cpp:2707] Resuming master@172.31.1.104:57341 at 2016-09-17 01:35:17.224888064+00:00 I0917 01:35:17.225109 475 master.cpp:4874] Ignoring register agent message from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) as admission is already in progress {code} > Several tests are flaky, with futures timing out early >
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496966#comment-15496966 ] Vinod Kone commented on MESOS-6180: --- Looking at `CGROUPS_ROOT_PidNamespaceForward` the TASK_LOST is expected because the test doesn't wait for TASK_RUNNING update before terminating the agent. {quote} Future registerExecutorMessage = FUTURE_MESSAGE(Eq(RegisterExecutorMessage().GetTypeName()), _, _); driver.launchTasks(offers1.get()[0].id(), {task1}); AWAIT_READY(registerExecutorMessage); Futurecontainers = containerizer->containers(); AWAIT_READY(containers); EXPECT_EQ(1u, containers.get().size()); ContainerID containerId = *(containers.get().begin()); // Stop the slave. slave.get()->terminate(); {quote} > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496724#comment-15496724 ] haosdent commented on MESOS-6180: - Yep, the order of the log you mentioned is correct as well. Let's split it to stdout and stderr. {code:title=grep -v 'W:' (stdout)|borderStyle=solid} [02:57:42] : [Step 10/10] [ RUN ] SlaveRecoveryTest/0.ReconnectHTTPExecutor [02:57:43] : [Step 10/10] Received SUBSCRIBED event [02:57:43] : [Step 10/10] Subscribed executor on ip-172-30-2-23.mesosphere.io [02:57:43] : [Step 10/10] Received LAUNCH event [02:57:43] : [Step 10/10] Starting task c1ba3f0b-2f6a-46a1-b752-592394c6d726 [02:57:43] : [Step 10/10] /mnt/teamcity/work/4240ba9ddd0997c3/build/src/mesos-containerizer launch --command="{"shell":true,"value":"sleep 1000"}" --help="false" --unshare_namespace_mnt="false" [02:57:43] : [Step 10/10] Forked command at 4653 [02:57:43] : [Step 10/10] Received ERROR event [02:57:43] : [Step 10/10] Received ERROR event [02:57:58] : [Step 10/10] ../../src/tests/slave_recovery_tests.cpp:510: Failure [02:57:58] : [Step 10/10] Failed to wait 15secs for status [02:57:58] : [Step 10/10] ../../src/tests/slave_recovery_tests.cpp:491: Failure [02:57:58] : [Step 10/10] Actual function call count doesn't match EXPECT_CALL(sched, statusUpdate(_, _))... [02:57:58] : [Step 10/10] Expected: to be called at least once [02:57:58] : [Step 10/10]Actual: never called - unsatisfied and active [02:58:13] : [Step 10/10] ../../src/tests/cluster.cpp:560: Failure [02:58:13] : [Step 10/10] Failed to wait 15secs for wait [02:59:18] : [Step 10/10] [ FAILED ] SlaveRecoveryTest/0.ReconnectHTTPExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer (95963 ms) {code} {code:title=grep 'W:' (stdout - SlaveRecoveryTest/0.RecoverStatusUpdateManager)|borderStyle=solid} [02:59:18]W: [Step 10/10] I0915 02:57:42.726838 24222 hierarchical.cpp:1770] No inverse offers to send out! [02:59:18]W: [Step 10/10] I0915 02:57:42.726851 24222 hierarchical.cpp:1271] Performed allocation for 1 agents in 80513ns [02:59:18]W: [Step 10/10] I0915 02:57:42.929819 24218 slave.cpp:3521] Cleaning up un-reregistered executors [02:59:18]W: [Step 10/10] I0915 02:57:42.929872 24218 slave.cpp:5197] Finished recovery [02:59:18]W: [Step 10/10] I0915 02:57:42.930137 24218 slave.cpp:5369] Querying resource estimator for oversubscribable resources [02:59:18]W: [Step 10/10] I0915 02:57:42.930229 24220 slave.cpp:5383] Received oversubscribable resources from the resource estimator [02:59:18]W: [Step 10/10] I0915 02:57:42.930289 24220 slave.cpp:911] New master detected at master@172.30.2.23:32968 [02:59:18]W: [Step 10/10] I0915 02:57:42.930301 24220 slave.cpp:970] Authenticating with master master@172.30.2.23:32968 [02:59:18]W: [Step 10/10] I0915 02:57:42.930315 24220 slave.cpp:981] Using default CRAM-MD5 authenticatee [02:59:18]W: [Step 10/10] I0915 02:57:42.930336 24217 status_update_manager.cpp:177] Pausing sending status updates [02:59:18]W: [Step 10/10] I0915 02:57:42.930364 24220 slave.cpp:943] Detecting new master [02:59:18]W: [Step 10/10] I0915 02:57:42.930382 24217 authenticatee.cpp:121] Creating new client SASL connection [02:59:18]W: [Step 10/10] I0915 02:57:42.930631 24216 master.cpp:6234] Authenticating slave(353)@172.30.2.23:32968 [02:59:18]W: [Step 10/10] I0915 02:57:42.930697 24219 authenticator.cpp:414] Starting authentication session for crammd5-authenticatee(755)@172.30.2.23:32968 [02:59:18]W: [Step 10/10] I0915 02:57:42.930804 24218 authenticator.cpp:98] Creating new server SASL connection [02:59:18]W: [Step 10/10] I0915 02:57:42.930964 24218 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5 [02:59:18]W: [Step 10/10] I0915 02:57:42.930977 24218 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5' [02:59:18]W: [Step 10/10] I0915 02:57:42.931010 24218 authenticator.cpp:204] Received SASL authentication start [02:59:18]W: [Step 10/10] I0915 02:57:42.931037 24218 authenticator.cpp:326] Authentication requires more steps [02:59:18]W: [Step 10/10] I0915 02:57:42.931064 24218 authenticatee.cpp:259] Received SASL authentication step [02:59:18]W: [Step 10/10] I0915 02:57:42.931098 24218 authenticator.cpp:232] Received SASL authentication step [02:59:18]W: [Step 10/10] I0915 02:57:42.931109 24218 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: 'ip-172-30-2-23.mesosphere.io' server FQDN: 'ip-172-30-2-23.mesosphere.io' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false [02:59:18]W: [Step 10/10] I0915 02:57:42.931114 24218 auxprop.cpp:181] Looking up auxiliary property '*userPassword'
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496662#comment-15496662 ] Greg Mann commented on MESOS-6180: -- Thanks for the patches, [~haosd...@gmail.com]!! I'll review and do some testing this morning. Regarding the interleaving: for example, in the log posted in MESOS-6164 we find the line: {code} Checkpointing framework pid 'scheduler-26d5bb2d-7233-4725-9755-169f84aee769@172.30.2.23:32968' to '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_RecoverStatusUpdateManager_w0ToCt/meta/slaves/d22b6309-24c3-422f-a501-a672e7c3e046-S0/frameworks/d22b6309-24c3-422f-a501-a672e7c3e046-/framework.pid' {code} which indicates that this output can be attributed to {{SlaveRecoveryTest.RecoverStatusUpdateManager}}. I think {{SlaveRecoveryTest.ReconnectHTTPExecutor}} begins much later with the line: {{I0915 02:57:42.981866 24202 cluster.cpp:157] Creating default 'local' authorizer}}. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495932#comment-15495932 ] haosdent commented on MESOS-6180: - [~greggomann] I use {{grep 'W:'}} and {{grep -v 'W:'}} to filter the stdout/stderr of MESOS-6164, MESOS-6165, and MESOS-6166. Looks like their log are not overlapping. Do you have some overlap examples that not meet this? > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495249#comment-15495249 ] haosdent commented on MESOS-6180: - It looks like related to {{namespaces/pid}} leaking files, I could reproduce some of them after run repeat. Let me try to fix this. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495213#comment-15495213 ] Jie Yu commented on MESOS-6180: --- This test looks suspicious to me. The log interleaving starts from there. The TASK_LOST is not expected. {noformat} [23:32:52] : [Step 10/10] [ RUN ] MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward [23:32:52]W: [Step 10/10] I0915 23:32:52.347380 29518 cluster.cpp:157] Creating default 'local' authorizer [23:32:52]W: [Step 10/10] I0915 23:32:52.350111 29518 leveldb.cpp:174] Opened db in 2.618094ms [23:32:52]W: [Step 10/10] I0915 23:32:52.350518 29518 leveldb.cpp:181] Compacted db in 390273ns [23:32:52]W: [Step 10/10] I0915 23:32:52.350536 29518 leveldb.cpp:196] Created db iterator in 3479ns [23:32:52]W: [Step 10/10] I0915 23:32:52.350543 29518 leveldb.cpp:202] Seeked to beginning of db in 464ns [23:32:52]W: [Step 10/10] I0915 23:32:52.350548 29518 leveldb.cpp:271] Iterated through 0 keys in the db in 364ns [23:32:52]W: [Step 10/10] I0915 23:32:52.350558 29518 replica.cpp:776] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned [23:32:52]W: [Step 10/10] I0915 23:32:52.350740 29532 recover.cpp:451] Starting replica recovery [23:32:52]W: [Step 10/10] I0915 23:32:52.350931 29533 recover.cpp:477] Replica is in EMPTY status [23:32:52]W: [Step 10/10] I0915 23:32:52.351176 29536 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from __req_res__(4863)@172.30.2.144:39560 [23:32:52]W: [Step 10/10] I0915 23:32:52.351282 29534 recover.cpp:197] Received a recover response from a replica in EMPTY status [23:32:52]W: [Step 10/10] I0915 23:32:52.351387 29537 recover.cpp:568] Updating replica status to STARTING [23:32:52]W: [Step 10/10] I0915 23:32:52.351835 29535 master.cpp:380] Master b8554850-0e42-40dd--58d6c6f19074 (ip-172-30-2-144.ec2.internal.mesosphere.io) started on 172.30.2.144:39560 [23:32:52]W: [Step 10/10] I0915 23:32:52.351847 29535 master.cpp:382] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/8wMNif/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/8wMNif/master" --zk_session_timeout="10secs" [23:32:52]W: [Step 10/10] I0915 23:32:52.351948 29535 master.cpp:432] Master only allowing authenticated frameworks to register [23:32:52]W: [Step 10/10] I0915 23:32:52.351954 29535 master.cpp:446] Master only allowing authenticated agents to register [23:32:52]W: [Step 10/10] I0915 23:32:52.351958 29535 master.cpp:459] Master only allowing authenticated HTTP frameworks to register [23:32:52]W: [Step 10/10] I0915 23:32:52.351963 29535 credentials.hpp:37] Loading credentials for authentication from '/tmp/8wMNif/credentials' [23:32:52]W: [Step 10/10] I0915 23:32:52.352077 29535 master.cpp:504] Using default 'crammd5' authenticator [23:32:52]W: [Step 10/10] I0915 23:32:52.352133 29535 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-readonly' [23:32:52]W: [Step 10/10] I0915 23:32:52.352217 29535 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' [23:32:52]W: [Step 10/10] I0915 23:32:52.352254 29535 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' [23:32:52]W: [Step 10/10] I0915 23:32:52.352289 29535 master.cpp:584] Authorization enabled [23:32:52]W: [Step 10/10] I0915 23:32:52.352322 29537 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 841411ns [23:32:52]W: [Step 10/10] I0915 23:32:52.352339 29537 replica.cpp:320] Persisted replica status to STARTING [23:32:52]W: [Step 10/10] I0915 23:32:52.352345 29533 whitelist_watcher.cpp:77] No whitelist given [23:32:52]W: [Step 10/10] I0915 23:32:52.352377 29539 hierarchical.cpp:149] Initialized hierarchical allocator process [23:32:52]W: [Step 10/10] I0915
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495187#comment-15495187 ] Greg Mann commented on MESOS-6180: -- [~haosd...@gmail.com]: unfortunately I think that there is some interleaving going on in these logs, sorry :( I need to sort through the log output and make sure I've matched up the logs with the correct test cases. It's possible that I've attributed the failures to the incorrect tests; I'll comment here and post revised logs when I have it sorted out. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495158#comment-15495158 ] Greg Mann commented on MESOS-6180: -- Thanks [~haosd...@gmail.com]! There are logs for a few of the cases in the following tickets: MESOS-6164, MESOS-6165, and MESOS-6166 > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495137#comment-15495137 ] haosdent commented on MESOS-6180: - [~greggomann] Would you provide a log as an example since they have similar errors. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)