[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-19 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497951#comment-15497951
 ] 

Greg Mann edited comment on MESOS-6180 at 9/19/16 8:38 PM:
---

Thanks for the patch to address the mount leak [~jieyu]! 
(https://reviews.apache.org/r/51963/)

I ran {{sudo MESOS_VERBOSE=1 GLOG_v=2 GTEST_REPEAT=-1 GTEST_BREAK_ON_FAILURE=1 
GTEST_FILTER="\*MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespace\*"
 bin/mesos-tests.sh}} and stressed my machine with {{stress -c N -i N -m N -d 
1}}, where {{N}} is number of cores, and I was able to reproduce a couple of 
these offer future timeout failures after a few tens of repetitions. I attached 
logs above as {{flaky-containerizer-pid-namespace-forward.txt}} and 
{{flaky-containerizer-pid-namespace-backward.txt}}.

We can see the master beginning agent registration, but we never see the line 
{{Registered agent ...}} from {{Master::_registerSlave()}}, which indicates 
that registration is complete and the registered message has been sent to the 
agent:
{code}
I0917 01:35:17.184216   480 master.cpp:4886] Registering agent at 
slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) with 
id fa7a42d0-5d0c-4799-b19f-2a85b43039f3-S0
I0917 01:35:17.184232   474 process.cpp:2707] Resuming 
__reaper__(1)@172.31.1.104:57341 at 2016-09-17 01:35:17.184222976+00:00
I0917 01:35:17.184377   474 process.cpp:2707] Resuming 
registrar(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184371968+00:00
I0917 01:35:17.184554   474 registrar.cpp:464] Applied 1 operations in 79217ns; 
attempting to update the registry
I0917 01:35:17.184953   474 process.cpp:2697] Spawned process 
__latch__(141)@172.31.1.104:57341
I0917 01:35:17.184990   485 process.cpp:2707] Resuming 
log-storage(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184982016+00:00
I0917 01:35:17.185561   485 process.cpp:2707] Resuming 
log-writer(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185552896+00:00
I0917 01:35:17.185609   485 log.cpp:577] Attempting to append 434 bytes to the 
log
I0917 01:35:17.185804   485 process.cpp:2707] Resuming 
log-coordinator(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185797888+00:00
I0917 01:35:17.185863   485 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 3
I0917 01:35:17.185998   485 process.cpp:2697] Spawned process 
log-write(29)@172.31.1.104:57341
I0917 01:35:17.186030   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186021888+00:00
I0917 01:35:17.186189   475 process.cpp:2707] Resuming 
log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186182912+00:00
I0917 01:35:17.186275   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186267904+00:00
I0917 01:35:17.186424   475 process.cpp:2707] Resuming 
log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186416896+00:00
I0917 01:35:17.186575   475 process.cpp:2697] Spawned process 
__req_res__(55)@172.31.1.104:57341
I0917 01:35:17.186724   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186717952+00:00
I0917 01:35:17.186609   485 process.cpp:2707] Resuming 
__req_res__(55)@172.31.1.104:57341 at 2016-09-17 01:35:17.186601984+00:00
I0917 01:35:17.186898   485 process.cpp:2707] Resuming 
log-replica(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186892032+00:00
I0917 01:35:17.186962   485 replica.cpp:537] Replica received write request for 
position 3 from __req_res__(55)@172.31.1.104:57341
I0917 01:35:17.185014   471 process.cpp:2707] Resuming 
__gc__@172.31.1.104:57341 at 2016-09-17 01:35:17.185008896+00:00
I0917 01:35:17.185036   480 process.cpp:2707] Resuming 
__latch__(141)@172.31.1.104:57341 at 2016-09-17 01:35:17.185029120+00:00
I0917 01:35:17.196358   482 process.cpp:2707] Resuming 
slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.196335104+00:00
I0917 01:35:17.196900   482 slave.cpp:1471] Will retry registration in 
25.224033ms if necessary
I0917 01:35:17.197029   482 process.cpp:2707] Resuming 
master@172.31.1.104:57341 at 2016-09-17 01:35:17.197024000+00:00
I0917 01:35:17.197157   482 master.cpp:4874] Ignoring register agent message 
from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) 
as admission is already in progress
I0917 01:35:17.224309   482 process.cpp:2707] Resuming 
slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.224284928+00:00
I0917 01:35:17.224845   482 slave.cpp:1471] Will retry registration in 
63.510932ms if necessary
I0917 01:35:17.224900   475 process.cpp:2707] Resuming 
master@172.31.1.104:57341 at 2016-09-17 01:35:17.224888064+00:00
I0917 01:35:17.225109   475 master.cpp:4874] Ignoring register agent message 
from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) 
as admission is already in progress
{code}


was (Author: greggomann):
Thanks for the patch to 

[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-19 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500170#comment-15500170
 ] 

haosdent edited comment on MESOS-6180 at 9/19/16 4:13 PM:
--

Try to reproduce with stress in an aws instance (16 cpus, 30 gb mem, Ubuntu 
14.04, c4.4xlarge), but could not reproduce after 5429 iterations as well.


was (Author: haosd...@gmail.com):
Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 
14.04), but could not reproduce after 5429 iterations as well.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500170#comment-15500170
 ] 

haosdent edited comment on MESOS-6180 at 9/18/16 3:41 AM:
--

Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 
14.04), but could not reproduce after 5429 iterations as well.


was (Author: haosd...@gmail.com):
Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 
14.04), but could not reproduce as well.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)