[jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky

2018-04-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431156#comment-16431156
 ] 

Andrei Budnik commented on MESOS-8489:
--

[https://reviews.apache.org/r/66404/]
 [https://reviews.apache.org/r/66474/]

>From `man 7 cgroups`:
{code:java}
   freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER)
  The freezer cgroup can suspend and restore (resume) all tasks in 
a cgroup.  Freezing a cgroup /A also causes its children,  for
  example, tasks in /A/B, to be frozen.
{code}
I've double-checked this ^^ assertion by installing Ubuntu 9.04 which runs on 
Linux kernel 2.6.28 and creating a nested freezer cgroup manually:  !ubuntu 
9.04.png!

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --
>
> Key: MESOS-8489
> URL: https://issues.apache.org/jira/browse/MESOS-8489
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ROOT_IsolatorFlags-badrun3.txt, ubuntu 9.04.png
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the second slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky

2018-04-04 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425431#comment-16425431
 ] 

Andrei Budnik commented on MESOS-8489:
--

We have multiple race conditions between simultaneously running agents in 
tests. We launch slaves using the same cgroup hierarchy by default. Linux 
launcher and some isolators call `cgroups::prepare()`, which creates and then 
immediately removes `mesos/test` cgroup to check whether the kernel supports 
nested cgroups.

First race condition is between `LinuxLauncher::create()` and 
`LinuxLauncher::recover()`. First one calls `cgroups::prepare()` while the 
other iterates over cgroups hierarchy to detect orphan containers. Also, we 
call `destroy()` for detected orphan containers - that also leads to a race 
condition.

Second race condition happens when `cgroups::prepare()` is called in parallel.

https://reviews.apache.org/r/66449/ - fixes all above cases for 
`LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags` test.

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --
>
> Key: MESOS-8489
> URL: https://issues.apache.org/jira/browse/MESOS-8489
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the second slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky

2018-03-28 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417999#comment-16417999
 ] 

Gilbert Song commented on MESOS-8489:
-

[~abudnik], thanks for the triaging. However, I think we did not understand 
this issue deep enough:
# The race description seems not accurate enough to me. The race is between the 
destruction of the first cluster::slave and the orphan container destroy in the 
second slave's recovery path. We should reset the Owned pointer first before we 
call next StartSlave(). (This would fix the flakiness in this unit test)
# We need to understand why the nested *test* cgroup is still there when we 
create the first slave, since it is just a simple os::rmdir(). This is the 
trigger of the flakiness. The *test* cgroup is supposed to be created and 
removed immediately. There might be a bug in cgroup::remove(). 
https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L485
# The nested *test* cgroup may no longer be needed since it was a workaround 
for old kernel versions. Could you do some investigations on whether this is 
supported by kenel version later than 2.6? We may be able to remove these code 
and document it (Still need to understand #2 though). 
https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L461~#L488

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --
>
> Key: MESOS-8489
> URL: https://issues.apache.org/jira/browse/MESOS-8489
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the first slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)