[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-10-03 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544162#comment-15544162
 ] 

haosdent commented on MESOS-6180:
-

Thanks a lot for [~greggomann]'s helps!

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> RoleTest.ImplicitRoleRegister.txt, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-10-03 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543283#comment-15543283
 ] 

Greg Mann commented on MESOS-6180:
--

AFAICT, these errors are due to performance issues on the AWS instances we're 
using for our CI. I'm closing this ticket and the linked tickets for now.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> RoleTest.ImplicitRoleRegister.txt, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-21 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508943#comment-15508943
 ] 

haosdent commented on MESOS-6180:
-

Awesome! Thanks a lot for your helps!

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> RoleTest.ImplicitRoleRegister.txt, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-20 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508838#comment-15508838
 ] 

Greg Mann commented on MESOS-6180:
--

Another common error seen when this issue manifests is:
{code}
Recovery failed: Failed to recover registrar: Failed to perform fetch within 
1mins
{code}
See the file {{RoleTest.ImplicitRoleRegister.txt}} for the full test log.

[~haosd...@gmail.com], there is a review 
[here|https://reviews.apache.org/r/41665/] proposing the {{in_memory}} registry 
for tests. I'm currently trying to figure out whether this is a legitimate bug 
or simply the result of an unreasonable load put on the machine.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> RoleTest.ImplicitRoleRegister.txt, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507560#comment-15507560
 ] 

haosdent commented on MESOS-6180:
-

Highly appreciate [~greggomann] to help reproduce in my AWS instance!!! The 
reason why I couldn't reproduce before is I run {{stress}} and {{mesos-tests}} 
in the separate disk which different with the root disk. So the {{stress}} 
didn't affect the root filesystem Linux used. If I run {{stress}} in the root 
disk and run {{mesos-test}} in the separate disk, then it could reproduce in 
few test iterations.

A workaround for this is to use {{flags.registry = "in_memory"}} when run tests 
which I have not reproduce the errors after use it.  But now I think the test 
cases failure should be expected because the root filesystem could not work as 
normal. Do you think we should use {{flags.registry = "in_memory"}} or just 
ignore these failures? cc [~jieyu] [~vinodkone] [~kaysoky]

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500170#comment-15500170
 ] 

haosdent commented on MESOS-6180:
-

Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 
14.04), but could not reproduce as well.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499226#comment-15499226
 ] 

haosdent commented on MESOS-6180:
-

| Waited for agent finish registering in test case. | 
https://reviews.apache.org/r/51985/ |

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499128#comment-15499128
 ] 

haosdent commented on MESOS-6180:
-

It looks like it blocked at {{ReplicaProcess::write}} which need to read from 
leveldb. I have not yet reproduced with stress after 3000 iterations at both 
physical machine (4 cpus, 32 gb mem, Ubuntu 14.04) and my local virtualbox vm 
(4 cpus, 8 gb mem). 



> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497951#comment-15497951
 ] 

Greg Mann commented on MESOS-6180:
--

Thanks for the patch to address the mount leak [~jieyu]! 
(https://reviews.apache.org/r/51963/)

I ran {{sudo MESOS_VERBOSE=1 GLOG_v=2 GTEST_REPEAT=-1 GTEST_BREAK_ON_FAILURE=1 
GTEST_FILTER="*MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespace*" 
bin/mesos-tests.sh}} and stressed my machine with {{stress -c N -i N -m N -d 
1}}, where {{N}} is number of cores, and I was able to reproduce a couple of 
these offer future timeout failures after a few tens of repetitions. I attached 
logs above as {{flaky-containerizer-pid-namespace-forward.txt}} and 
{{flaky-containerizer-pid-namespace-backward.txt}}.

We can see the master beginning agent registration, but we never see the line 
{{Registered agent ...}} from {{Master::_registerSlave()}}, which indicates 
that registration is complete and the registered message has been sent to the 
agent:
{code}
I0917 01:35:17.184216   480 master.cpp:4886] Registering agent at 
slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) with 
id fa7a42d0-5d0c-4799-b19f-2a85b43039f3-S0
I0917 01:35:17.184232   474 process.cpp:2707] Resuming 
__reaper__(1)@172.31.1.104:57341 at 2016-09-17 01:35:17.184222976+00:00
I0917 01:35:17.184377   474 process.cpp:2707] Resuming 
registrar(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184371968+00:00
I0917 01:35:17.184554   474 registrar.cpp:464] Applied 1 operations in 79217ns; 
attempting to update the registry
I0917 01:35:17.184953   474 process.cpp:2697] Spawned process 
__latch__(141)@172.31.1.104:57341
I0917 01:35:17.184990   485 process.cpp:2707] Resuming 
log-storage(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184982016+00:00
I0917 01:35:17.185561   485 process.cpp:2707] Resuming 
log-writer(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185552896+00:00
I0917 01:35:17.185609   485 log.cpp:577] Attempting to append 434 bytes to the 
log
I0917 01:35:17.185804   485 process.cpp:2707] Resuming 
log-coordinator(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185797888+00:00
I0917 01:35:17.185863   485 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 3
I0917 01:35:17.185998   485 process.cpp:2697] Spawned process 
log-write(29)@172.31.1.104:57341
I0917 01:35:17.186030   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186021888+00:00
I0917 01:35:17.186189   475 process.cpp:2707] Resuming 
log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186182912+00:00
I0917 01:35:17.186275   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186267904+00:00
I0917 01:35:17.186424   475 process.cpp:2707] Resuming 
log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186416896+00:00
I0917 01:35:17.186575   475 process.cpp:2697] Spawned process 
__req_res__(55)@172.31.1.104:57341
I0917 01:35:17.186724   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186717952+00:00
I0917 01:35:17.186609   485 process.cpp:2707] Resuming 
__req_res__(55)@172.31.1.104:57341 at 2016-09-17 01:35:17.186601984+00:00
I0917 01:35:17.186898   485 process.cpp:2707] Resuming 
log-replica(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186892032+00:00
I0917 01:35:17.186962   485 replica.cpp:537] Replica received write request for 
position 3 from __req_res__(55)@172.31.1.104:57341
I0917 01:35:17.185014   471 process.cpp:2707] Resuming 
__gc__@172.31.1.104:57341 at 2016-09-17 01:35:17.185008896+00:00
I0917 01:35:17.185036   480 process.cpp:2707] Resuming 
__latch__(141)@172.31.1.104:57341 at 2016-09-17 01:35:17.185029120+00:00
I0917 01:35:17.196358   482 process.cpp:2707] Resuming 
slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.196335104+00:00
I0917 01:35:17.196900   482 slave.cpp:1471] Will retry registration in 
25.224033ms if necessary
I0917 01:35:17.197029   482 process.cpp:2707] Resuming 
master@172.31.1.104:57341 at 2016-09-17 01:35:17.197024000+00:00
I0917 01:35:17.197157   482 master.cpp:4874] Ignoring register agent message 
from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) 
as admission is already in progress
I0917 01:35:17.224309   482 process.cpp:2707] Resuming 
slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.224284928+00:00
I0917 01:35:17.224845   482 slave.cpp:1471] Will retry registration in 
63.510932ms if necessary
I0917 01:35:17.224900   475 process.cpp:2707] Resuming 
master@172.31.1.104:57341 at 2016-09-17 01:35:17.224888064+00:00
I0917 01:35:17.225109   475 master.cpp:4874] Ignoring register agent message 
from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) 
as admission is already in progress
{code}

> Several tests are flaky, with futures timing out early
> 

[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-16 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496966#comment-15496966
 ] 

Vinod Kone commented on MESOS-6180:
---

Looking at `CGROUPS_ROOT_PidNamespaceForward` the TASK_LOST is expected because 
the test doesn't wait for TASK_RUNNING update before terminating the agent.

{quote}
  Future registerExecutorMessage =
FUTURE_MESSAGE(Eq(RegisterExecutorMessage().GetTypeName()), _, _);

  driver.launchTasks(offers1.get()[0].id(), {task1});

  AWAIT_READY(registerExecutorMessage);

  Future containers = containerizer->containers();
  AWAIT_READY(containers);
  EXPECT_EQ(1u, containers.get().size());

  ContainerID containerId = *(containers.get().begin());

  // Stop the slave.
  slave.get()->terminate();

{quote}

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-16 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496724#comment-15496724
 ] 

haosdent commented on MESOS-6180:
-

Yep, the order of the log you mentioned is correct as well.

Let's split it to stdout and stderr.

{code:title=grep -v 'W:' (stdout)|borderStyle=solid}
[02:57:42] : [Step 10/10] [ RUN  ] 
SlaveRecoveryTest/0.ReconnectHTTPExecutor
[02:57:43] : [Step 10/10] Received SUBSCRIBED event
[02:57:43] : [Step 10/10] Subscribed executor on 
ip-172-30-2-23.mesosphere.io
[02:57:43] : [Step 10/10] Received LAUNCH event
[02:57:43] : [Step 10/10] Starting task c1ba3f0b-2f6a-46a1-b752-592394c6d726
[02:57:43] : [Step 10/10] 
/mnt/teamcity/work/4240ba9ddd0997c3/build/src/mesos-containerizer launch 
--command="{"shell":true,"value":"sleep 1000"}" --help="false" 
--unshare_namespace_mnt="false"
[02:57:43] : [Step 10/10] Forked command at 4653
[02:57:43] : [Step 10/10] Received ERROR event
[02:57:43] : [Step 10/10] Received ERROR event
[02:57:58] : [Step 10/10] ../../src/tests/slave_recovery_tests.cpp:510: 
Failure
[02:57:58] : [Step 10/10] Failed to wait 15secs for status
[02:57:58] : [Step 10/10] ../../src/tests/slave_recovery_tests.cpp:491: 
Failure
[02:57:58] : [Step 10/10] Actual function call count doesn't match 
EXPECT_CALL(sched, statusUpdate(_, _))...
[02:57:58] : [Step 10/10]  Expected: to be called at least once
[02:57:58] : [Step 10/10]Actual: never called - unsatisfied and 
active
[02:58:13] : [Step 10/10] ../../src/tests/cluster.cpp:560: Failure
[02:58:13] : [Step 10/10] Failed to wait 15secs for wait
[02:59:18] : [Step 10/10] [  FAILED  ] 
SlaveRecoveryTest/0.ReconnectHTTPExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer (95963 ms)
{code}

{code:title=grep 'W:' (stdout - 
SlaveRecoveryTest/0.RecoverStatusUpdateManager)|borderStyle=solid}
[02:59:18]W: [Step 10/10] I0915 02:57:42.726838 24222 
hierarchical.cpp:1770] No inverse offers to send out!
[02:59:18]W: [Step 10/10] I0915 02:57:42.726851 24222 
hierarchical.cpp:1271] Performed allocation for 1 agents in 80513ns
[02:59:18]W: [Step 10/10] I0915 02:57:42.929819 24218 slave.cpp:3521] 
Cleaning up un-reregistered executors
[02:59:18]W: [Step 10/10] I0915 02:57:42.929872 24218 slave.cpp:5197] 
Finished recovery
[02:59:18]W: [Step 10/10] I0915 02:57:42.930137 24218 slave.cpp:5369] 
Querying resource estimator for oversubscribable resources
[02:59:18]W: [Step 10/10] I0915 02:57:42.930229 24220 slave.cpp:5383] 
Received oversubscribable resources  from the resource estimator
[02:59:18]W: [Step 10/10] I0915 02:57:42.930289 24220 slave.cpp:911] New 
master detected at master@172.30.2.23:32968
[02:59:18]W: [Step 10/10] I0915 02:57:42.930301 24220 slave.cpp:970] 
Authenticating with master master@172.30.2.23:32968
[02:59:18]W: [Step 10/10] I0915 02:57:42.930315 24220 slave.cpp:981] Using 
default CRAM-MD5 authenticatee
[02:59:18]W: [Step 10/10] I0915 02:57:42.930336 24217 
status_update_manager.cpp:177] Pausing sending status updates
[02:59:18]W: [Step 10/10] I0915 02:57:42.930364 24220 slave.cpp:943] 
Detecting new master
[02:59:18]W: [Step 10/10] I0915 02:57:42.930382 24217 
authenticatee.cpp:121] Creating new client SASL connection
[02:59:18]W: [Step 10/10] I0915 02:57:42.930631 24216 master.cpp:6234] 
Authenticating slave(353)@172.30.2.23:32968
[02:59:18]W: [Step 10/10] I0915 02:57:42.930697 24219 
authenticator.cpp:414] Starting authentication session for 
crammd5-authenticatee(755)@172.30.2.23:32968
[02:59:18]W: [Step 10/10] I0915 02:57:42.930804 24218 authenticator.cpp:98] 
Creating new server SASL connection
[02:59:18]W: [Step 10/10] I0915 02:57:42.930964 24218 
authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5
[02:59:18]W: [Step 10/10] I0915 02:57:42.930977 24218 
authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5'
[02:59:18]W: [Step 10/10] I0915 02:57:42.931010 24218 
authenticator.cpp:204] Received SASL authentication start
[02:59:18]W: [Step 10/10] I0915 02:57:42.931037 24218 
authenticator.cpp:326] Authentication requires more steps
[02:59:18]W: [Step 10/10] I0915 02:57:42.931064 24218 
authenticatee.cpp:259] Received SASL authentication step
[02:59:18]W: [Step 10/10] I0915 02:57:42.931098 24218 
authenticator.cpp:232] Received SASL authentication step
[02:59:18]W: [Step 10/10] I0915 02:57:42.931109 24218 auxprop.cpp:109] 
Request to lookup properties for user: 'test-principal' realm: 
'ip-172-30-2-23.mesosphere.io' server FQDN: 'ip-172-30-2-23.mesosphere.io' 
SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
SASL_AUXPROP_AUTHZID: false
[02:59:18]W: [Step 10/10] I0915 02:57:42.931114 24218 auxprop.cpp:181] 
Looking up auxiliary property '*userPassword'

[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496662#comment-15496662
 ] 

Greg Mann commented on MESOS-6180:
--

Thanks for the patches, [~haosd...@gmail.com]!! I'll review and do some testing 
this morning.

Regarding the interleaving: for example, in the log posted in MESOS-6164 we 
find the line:
{code}
Checkpointing framework pid 
'scheduler-26d5bb2d-7233-4725-9755-169f84aee769@172.30.2.23:32968' to 
'/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_RecoverStatusUpdateManager_w0ToCt/meta/slaves/d22b6309-24c3-422f-a501-a672e7c3e046-S0/frameworks/d22b6309-24c3-422f-a501-a672e7c3e046-/framework.pid'
{code}
which indicates that this output can be attributed to 
{{SlaveRecoveryTest.RecoverStatusUpdateManager}}. I think 
{{SlaveRecoveryTest.ReconnectHTTPExecutor}} begins much later with the line: 
{{I0915 02:57:42.981866 24202 cluster.cpp:157] Creating default 'local' 
authorizer}}.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-16 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495932#comment-15495932
 ] 

haosdent commented on MESOS-6180:
-

[~greggomann] I use {{grep 'W:'}} and {{grep -v 'W:'}} to filter the 
stdout/stderr of MESOS-6164, MESOS-6165, and MESOS-6166. Looks like their log 
are not overlapping. Do you have some overlap examples that not meet this?

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-15 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495249#comment-15495249
 ] 

haosdent commented on MESOS-6180:
-

It looks like related to {{namespaces/pid}} leaking files, I could reproduce 
some of them after run repeat. Let me try to fix this.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-15 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495213#comment-15495213
 ] 

Jie Yu commented on MESOS-6180:
---

This test looks suspicious to me. The log interleaving starts from there. The 
TASK_LOST is not expected. 
{noformat}
[23:32:52] : [Step 10/10] [ RUN  ] 
MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
[23:32:52]W: [Step 10/10] I0915 23:32:52.347380 29518 cluster.cpp:157] 
Creating default 'local' authorizer
[23:32:52]W: [Step 10/10] I0915 23:32:52.350111 29518 leveldb.cpp:174] 
Opened db in 2.618094ms
[23:32:52]W: [Step 10/10] I0915 23:32:52.350518 29518 leveldb.cpp:181] 
Compacted db in 390273ns
[23:32:52]W: [Step 10/10] I0915 23:32:52.350536 29518 leveldb.cpp:196] 
Created db iterator in 3479ns
[23:32:52]W: [Step 10/10] I0915 23:32:52.350543 29518 leveldb.cpp:202] 
Seeked to beginning of db in 464ns
[23:32:52]W: [Step 10/10] I0915 23:32:52.350548 29518 leveldb.cpp:271] 
Iterated through 0 keys in the db in 364ns
[23:32:52]W: [Step 10/10] I0915 23:32:52.350558 29518 replica.cpp:776] 
Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
[23:32:52]W: [Step 10/10] I0915 23:32:52.350740 29532 recover.cpp:451] 
Starting replica recovery
[23:32:52]W: [Step 10/10] I0915 23:32:52.350931 29533 recover.cpp:477] 
Replica is in EMPTY status
[23:32:52]W: [Step 10/10] I0915 23:32:52.351176 29536 replica.cpp:673] 
Replica in EMPTY status received a broadcasted recover request from 
__req_res__(4863)@172.30.2.144:39560
[23:32:52]W: [Step 10/10] I0915 23:32:52.351282 29534 recover.cpp:197] 
Received a recover response from a replica in EMPTY status
[23:32:52]W: [Step 10/10] I0915 23:32:52.351387 29537 recover.cpp:568] 
Updating replica status to STARTING
[23:32:52]W: [Step 10/10] I0915 23:32:52.351835 29535 master.cpp:380] 
Master b8554850-0e42-40dd--58d6c6f19074 
(ip-172-30-2-144.ec2.internal.mesosphere.io) started on 172.30.2.144:39560
[23:32:52]W: [Step 10/10] I0915 23:32:52.351847 29535 master.cpp:382] Flags 
at startup: --acls="" --agent_ping_timeout="15secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1secs" 
--allocator="HierarchicalDRF" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/8wMNif/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/8wMNif/master" --zk_session_timeout="10secs"
[23:32:52]W: [Step 10/10] I0915 23:32:52.351948 29535 master.cpp:432] 
Master only allowing authenticated frameworks to register
[23:32:52]W: [Step 10/10] I0915 23:32:52.351954 29535 master.cpp:446] 
Master only allowing authenticated agents to register
[23:32:52]W: [Step 10/10] I0915 23:32:52.351958 29535 master.cpp:459] 
Master only allowing authenticated HTTP frameworks to register
[23:32:52]W: [Step 10/10] I0915 23:32:52.351963 29535 credentials.hpp:37] 
Loading credentials for authentication from '/tmp/8wMNif/credentials'
[23:32:52]W: [Step 10/10] I0915 23:32:52.352077 29535 master.cpp:504] Using 
default 'crammd5' authenticator
[23:32:52]W: [Step 10/10] I0915 23:32:52.352133 29535 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
[23:32:52]W: [Step 10/10] I0915 23:32:52.352217 29535 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
[23:32:52]W: [Step 10/10] I0915 23:32:52.352254 29535 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
[23:32:52]W: [Step 10/10] I0915 23:32:52.352289 29535 master.cpp:584] 
Authorization enabled
[23:32:52]W: [Step 10/10] I0915 23:32:52.352322 29537 leveldb.cpp:304] 
Persisting metadata (8 bytes) to leveldb took 841411ns
[23:32:52]W: [Step 10/10] I0915 23:32:52.352339 29537 replica.cpp:320] 
Persisted replica status to STARTING
[23:32:52]W: [Step 10/10] I0915 23:32:52.352345 29533 
whitelist_watcher.cpp:77] No whitelist given
[23:32:52]W: [Step 10/10] I0915 23:32:52.352377 29539 hierarchical.cpp:149] 
Initialized hierarchical allocator process
[23:32:52]W: [Step 10/10] I0915 

[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-15 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495187#comment-15495187
 ] 

Greg Mann commented on MESOS-6180:
--

[~haosd...@gmail.com]: unfortunately I think that there is some interleaving 
going on in these logs, sorry :( I need to sort through the log output and make 
sure I've matched up the logs with the correct test cases. It's possible that 
I've attributed the failures to the incorrect tests; I'll comment here and post 
revised logs when I have it sorted out.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-15 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495158#comment-15495158
 ] 

Greg Mann commented on MESOS-6180:
--

Thanks [~haosd...@gmail.com]! There are logs for a few of the cases in the 
following tickets: MESOS-6164, MESOS-6165, and MESOS-6166

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-15 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495137#comment-15495137
 ] 

haosdent commented on MESOS-6180:
-

[~greggomann] Would you provide a log as an example since they have similar 
errors.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)