[jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
[ https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431156#comment-16431156 ] Andrei Budnik commented on MESOS-8489: -- [https://reviews.apache.org/r/66404/] [https://reviews.apache.org/r/66474/] >From `man 7 cgroups`: {code:java} freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER) The freezer cgroup can suspend and restore (resume) all tasks in a cgroup. Freezing a cgroup /A also causes its children, for example, tasks in /A/B, to be frozen. {code} I've double-checked this ^^ assertion by installing Ubuntu 9.04 which runs on Linux kernel 2.6.28 and creating a nested freezer cgroup manually: !ubuntu 9.04.png! > LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky > -- > > Key: MESOS-8489 > URL: https://issues.apache.org/jira/browse/MESOS-8489 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: ROOT_IsolatorFlags-badrun3.txt, ubuntu 9.04.png > > > Observed this on internal Mesosphere CI. > {code:java} > ../../src/tests/cluster.cpp:662: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { test } > {code} > h2. Steps to reproduce > # Add {{::sleep(1);}} before > [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] > "test" cgroup > # recompile > # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests > --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags > --gtest_break_on_failure --gtest_repeat=10 --verbose` > h2. Race description > While recovery is in progress for [the first > slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], > calling > [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] > leads to calling > [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] > to create a containerizer. An attempt to create a mesos c'zer, leads to > calling > [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. > Finally, we get to the point, where we try to create a ["test" > container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. > So, the recovery process for the second slave [might > detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] > this "test" container as an orphaned container. > Thus, there is the race between recovery process for the first slave and an > attempt to create a c'zer for the second agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
[ https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425431#comment-16425431 ] Andrei Budnik commented on MESOS-8489: -- We have multiple race conditions between simultaneously running agents in tests. We launch slaves using the same cgroup hierarchy by default. Linux launcher and some isolators call `cgroups::prepare()`, which creates and then immediately removes `mesos/test` cgroup to check whether the kernel supports nested cgroups. First race condition is between `LinuxLauncher::create()` and `LinuxLauncher::recover()`. First one calls `cgroups::prepare()` while the other iterates over cgroups hierarchy to detect orphan containers. Also, we call `destroy()` for detected orphan containers - that also leads to a race condition. Second race condition happens when `cgroups::prepare()` is called in parallel. https://reviews.apache.org/r/66449/ - fixes all above cases for `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags` test. > LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky > -- > > Key: MESOS-8489 > URL: https://issues.apache.org/jira/browse/MESOS-8489 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: ROOT_IsolatorFlags-badrun3.txt > > > Observed this on internal Mesosphere CI. > {code:java} > ../../src/tests/cluster.cpp:662: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { test } > {code} > h2. Steps to reproduce > # Add {{::sleep(1);}} before > [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] > "test" cgroup > # recompile > # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests > --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags > --gtest_break_on_failure --gtest_repeat=10 --verbose` > h2. Race description > While recovery is in progress for [the first > slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], > calling > [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] > leads to calling > [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] > to create a containerizer. An attempt to create a mesos c'zer, leads to > calling > [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. > Finally, we get to the point, where we try to create a ["test" > container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. > So, the recovery process for the second slave [might > detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] > this "test" container as an orphaned container. > Thus, there is the race between recovery process for the first slave and an > attempt to create a c'zer for the second agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
[ https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417999#comment-16417999 ] Gilbert Song commented on MESOS-8489: - [~abudnik], thanks for the triaging. However, I think we did not understand this issue deep enough: # The race description seems not accurate enough to me. The race is between the destruction of the first cluster::slave and the orphan container destroy in the second slave's recovery path. We should reset the Owned pointer first before we call next StartSlave(). (This would fix the flakiness in this unit test) # We need to understand why the nested *test* cgroup is still there when we create the first slave, since it is just a simple os::rmdir(). This is the trigger of the flakiness. The *test* cgroup is supposed to be created and removed immediately. There might be a bug in cgroup::remove(). https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L485 # The nested *test* cgroup may no longer be needed since it was a workaround for old kernel versions. Could you do some investigations on whether this is supported by kenel version later than 2.6? We may be able to remove these code and document it (Still need to understand #2 though). https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L461~#L488 > LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky > -- > > Key: MESOS-8489 > URL: https://issues.apache.org/jira/browse/MESOS-8489 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: ROOT_IsolatorFlags-badrun3.txt > > > Observed this on internal Mesosphere CI. > {code:java} > ../../src/tests/cluster.cpp:662: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { test } > {code} > h2. Steps to reproduce > # Add {{::sleep(1);}} before > [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] > "test" cgroup > # recompile > # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests > --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags > --gtest_break_on_failure --gtest_repeat=10 --verbose` > h2. Race description > While recovery is in progress for [the first > slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], > calling > [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] > leads to calling > [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] > to create a containerizer. An attempt to create a mesos c'zer, leads to > calling > [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. > Finally, we get to the point, where we try to create a ["test" > container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. > So, the recovery process for the first slave [might > detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] > this "test" container as an orphaned container. > Thus, there is the race between recovery process for the first slave and an > attempt to create a c'zer for the second agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)