[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497951#comment-15497951 ] Greg Mann edited comment on MESOS-6180 at 9/19/16 8:38 PM: --- Thanks for the patch to address the mount leak [~jieyu]! (https://reviews.apache.org/r/51963/) I ran {{sudo MESOS_VERBOSE=1 GLOG_v=2 GTEST_REPEAT=-1 GTEST_BREAK_ON_FAILURE=1 GTEST_FILTER="\*MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespace\*" bin/mesos-tests.sh}} and stressed my machine with {{stress -c N -i N -m N -d 1}}, where {{N}} is number of cores, and I was able to reproduce a couple of these offer future timeout failures after a few tens of repetitions. I attached logs above as {{flaky-containerizer-pid-namespace-forward.txt}} and {{flaky-containerizer-pid-namespace-backward.txt}}. We can see the master beginning agent registration, but we never see the line {{Registered agent ...}} from {{Master::_registerSlave()}}, which indicates that registration is complete and the registered message has been sent to the agent: {code} I0917 01:35:17.184216 480 master.cpp:4886] Registering agent at slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) with id fa7a42d0-5d0c-4799-b19f-2a85b43039f3-S0 I0917 01:35:17.184232 474 process.cpp:2707] Resuming __reaper__(1)@172.31.1.104:57341 at 2016-09-17 01:35:17.184222976+00:00 I0917 01:35:17.184377 474 process.cpp:2707] Resuming registrar(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184371968+00:00 I0917 01:35:17.184554 474 registrar.cpp:464] Applied 1 operations in 79217ns; attempting to update the registry I0917 01:35:17.184953 474 process.cpp:2697] Spawned process __latch__(141)@172.31.1.104:57341 I0917 01:35:17.184990 485 process.cpp:2707] Resuming log-storage(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184982016+00:00 I0917 01:35:17.185561 485 process.cpp:2707] Resuming log-writer(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185552896+00:00 I0917 01:35:17.185609 485 log.cpp:577] Attempting to append 434 bytes to the log I0917 01:35:17.185804 485 process.cpp:2707] Resuming log-coordinator(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185797888+00:00 I0917 01:35:17.185863 485 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 3 I0917 01:35:17.185998 485 process.cpp:2697] Spawned process log-write(29)@172.31.1.104:57341 I0917 01:35:17.186030 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186021888+00:00 I0917 01:35:17.186189 475 process.cpp:2707] Resuming log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186182912+00:00 I0917 01:35:17.186275 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186267904+00:00 I0917 01:35:17.186424 475 process.cpp:2707] Resuming log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186416896+00:00 I0917 01:35:17.186575 475 process.cpp:2697] Spawned process __req_res__(55)@172.31.1.104:57341 I0917 01:35:17.186724 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186717952+00:00 I0917 01:35:17.186609 485 process.cpp:2707] Resuming __req_res__(55)@172.31.1.104:57341 at 2016-09-17 01:35:17.186601984+00:00 I0917 01:35:17.186898 485 process.cpp:2707] Resuming log-replica(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186892032+00:00 I0917 01:35:17.186962 485 replica.cpp:537] Replica received write request for position 3 from __req_res__(55)@172.31.1.104:57341 I0917 01:35:17.185014 471 process.cpp:2707] Resuming __gc__@172.31.1.104:57341 at 2016-09-17 01:35:17.185008896+00:00 I0917 01:35:17.185036 480 process.cpp:2707] Resuming __latch__(141)@172.31.1.104:57341 at 2016-09-17 01:35:17.185029120+00:00 I0917 01:35:17.196358 482 process.cpp:2707] Resuming slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.196335104+00:00 I0917 01:35:17.196900 482 slave.cpp:1471] Will retry registration in 25.224033ms if necessary I0917 01:35:17.197029 482 process.cpp:2707] Resuming master@172.31.1.104:57341 at 2016-09-17 01:35:17.197024000+00:00 I0917 01:35:17.197157 482 master.cpp:4874] Ignoring register agent message from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) as admission is already in progress I0917 01:35:17.224309 482 process.cpp:2707] Resuming slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.224284928+00:00 I0917 01:35:17.224845 482 slave.cpp:1471] Will retry registration in 63.510932ms if necessary I0917 01:35:17.224900 475 process.cpp:2707] Resuming master@172.31.1.104:57341 at 2016-09-17 01:35:17.224888064+00:00 I0917 01:35:17.225109 475 master.cpp:4874] Ignoring register agent message from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) as admission is already in progress {code} was (Author: greggomann): Thanks for the patch to
[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500170#comment-15500170 ] haosdent edited comment on MESOS-6180 at 9/19/16 4:13 PM: -- Try to reproduce with stress in an aws instance (16 cpus, 30 gb mem, Ubuntu 14.04, c4.4xlarge), but could not reproduce after 5429 iterations as well. was (Author: haosd...@gmail.com): Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 14.04), but could not reproduce after 5429 iterations as well. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500170#comment-15500170 ] haosdent edited comment on MESOS-6180 at 9/18/16 3:41 AM: -- Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 14.04), but could not reproduce after 5429 iterations as well. was (Author: haosd...@gmail.com): Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 14.04), but could not reproduce as well. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)