[jira] [Comment Edited] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos
[ https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585253#comment-17585253 ] Qian Zhang edited comment on MESOS-10234 at 8/26/22 9:09 AM: - According to [https://blogs.apache.org/security/entry/cve-2021-44228|https://blogs.apache.org/security/entry/cve-2021-44228,] , it seems ZooKeeper is not affected by [CVE-2021-44228|https://www.cve.org/CVERecord?id=CVE-2021-44228]. was (Author: qianzhang): According to [https://blogs.apache.org/security/entry/cve-2021-44228,] it seems ZooKeeper is not affected by [CVE-2021-44228|https://www.cve.org/CVERecord?id=CVE-2021-44228]. > CVE-2021-44228 Log4j vulnerability for apache mesos > --- > > Key: MESOS-10234 > URL: https://issues.apache.org/jira/browse/MESOS-10234 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.11.0 >Reporter: Sangita Nalkar >Priority: Critical > > Hi, > Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache > mesos. > We see that log4j v1.2.17 is used while building apache mesos from source. > Snippet from build logs: > std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF > jvm/org/apache/.deps/libjava_la-log4j.Tpo -c > ../../src/jvm/org/apache/log4j.cpp -fPIC -DPIC -o > jvm/org/apache/.libs/libjava_la-log4j.o > Thanks, > Sangita -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos
[ https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585253#comment-17585253 ] Qian Zhang commented on MESOS-10234: According to [https://blogs.apache.org/security/entry/cve-2021-44228,] it seems ZooKeeper is not affected by [CVE-2021-44228|https://www.cve.org/CVERecord?id=CVE-2021-44228]. > CVE-2021-44228 Log4j vulnerability for apache mesos > --- > > Key: MESOS-10234 > URL: https://issues.apache.org/jira/browse/MESOS-10234 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.11.0 >Reporter: Sangita Nalkar >Priority: Critical > > Hi, > Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache > mesos. > We see that log4j v1.2.17 is used while building apache mesos from source. > Snippet from build logs: > std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF > jvm/org/apache/.deps/libjava_la-log4j.Tpo -c > ../../src/jvm/org/apache/log4j.cpp -fPIC -DPIC -o > jvm/org/apache/.libs/libjava_la-log4j.o > Thanks, > Sangita -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367313#comment-17367313 ] Qian Zhang commented on MESOS-10224: [~surahman] I think it has been fixed in this PR ([https://github.com/apache/mesos/pull/384)] by [~cf.natali] recently, can you please get the latest Mesos code and try again? > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366893#comment-17366893 ] Qian Zhang edited comment on MESOS-10224 at 6/21/21, 11:34 PM: --- [~surahman] Thanks for reporting the issue! Can you please run the following command to get the detailed error messages for the failed tests? {code:java} sudo ./bin/mesos-tests.sh --gtest_filter="" --verbose{code} was (Author: qianzhang): [~surahman] Thanks for reporting the issue! Can you please run the following command to get the detailed error messages for the failed test? {code:java} sudo ./bin/mesos-tests.sh --gtest_filter="" --verbose{code} > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366893#comment-17366893 ] Qian Zhang commented on MESOS-10224: [~surahman] Thanks for reporting the issue! Can you please run the following command to get the detailed error messages for the failed test? {code:java} sudo ./bin/mesos-tests.sh --gtest_filter="" --verbose{code} > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
[ https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10222: -- Assignee: Charles Natali Resolution: Fixed > Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses > --- > > Key: MESOS-10222 > URL: https://issues.apache.org/jira/browse/MESOS-10222 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Martin Tzvetanov Grigorov >Assignee: Charles Natali >Priority: Minor > Attachments: config.log > > > I am trying to build Mesos master but it fails with: > > {code:java} > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38, > from > ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12, > from ../../3rdparty/stout/include/stout/uuid.hpp:21, > from ../../include/mesos/type_utils.hpp:36, > from ../../src/master/flags.cpp:18: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27, > from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30, > from ../../3rdparty/stout/include/stout/numify.hpp:19, > from ../../3rdparty/stout/include/stout/duration.hpp:29, > from ../../3rdparty/libprocess/include/process/time.hpp:18, > from ../../3rdparty/libprocess/include/process/clock.hpp:18, > from ../../3rdparty/libprocess/include/process/future.hpp:29, > from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/local/local.cpp:24: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11, > from ../../include/mesos/resources.hpp:27, > from ../../src/master/master.hpp:31, > from ../../src/master/framework.cpp:17: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from
[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
[ https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360932#comment-17360932 ] Qian Zhang commented on MESOS-10222: Not yet, we are still working on [https://github.com/apache/mesos/pull/392]. > Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses > --- > > Key: MESOS-10222 > URL: https://issues.apache.org/jira/browse/MESOS-10222 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Martin Tzvetanov Grigorov >Priority: Minor > Attachments: config.log > > > I am trying to build Mesos master but it fails with: > > {code:java} > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38, > from > ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12, > from ../../3rdparty/stout/include/stout/uuid.hpp:21, > from ../../include/mesos/type_utils.hpp:36, > from ../../src/master/flags.cpp:18: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27, > from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30, > from ../../3rdparty/stout/include/stout/numify.hpp:19, > from ../../3rdparty/stout/include/stout/duration.hpp:29, > from ../../3rdparty/libprocess/include/process/time.hpp:18, > from ../../3rdparty/libprocess/include/process/clock.hpp:18, > from ../../3rdparty/libprocess/include/process/future.hpp:29, > from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/local/local.cpp:24: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11, > from ../../include/mesos/resources.hpp:27, > from ../../src/master/master.hpp:31, > from ../../src/master/framework.cpp:17: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from
[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360928#comment-17360928 ] Qian Zhang commented on MESOS-8400: --- I see there are still two patches not merged yet: [https://reviews.apache.org/r/71384] [https://reviews.apache.org/r/71385] [~bbannier] Can you please comment? Do we still need these two patches? > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce
[ https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354466#comment-17354466 ] Qian Zhang commented on MESOS-10220: Resolved by https://github.com/apache/mesos/pull/384. > ldcache::parse failed to parse newer ld.so.cahce > > > Key: MESOS-10220 > URL: https://issues.apache.org/jira/browse/MESOS-10220 > Project: Mesos > Issue Type: Bug >Reporter: Minh H.G. >Assignee: Charles Natali >Priority: Minor > > In glibc 2.31, the ld.so.cache file no longer support old format (the one > start with "ld.so-1.7.0") > That cause ldcache::parse to fail and mesos cannot start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support
[ https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212793#comment-17212793 ] Qian Zhang commented on MESOS-10192: commit 301902be4f1332799cf3b3242cd29b4907c21c09 Author: Qian Zhang Date: Sat Oct 10 15:04:57 2020 +0800 Ignored the directoy `/dev/nvidia-caps` when globing Nvidia GPU devices. The directory `/dev/nvidia-caps` was introduced in CUDA 11.0, just ignore it since we only care about the Nvidia GPU device files. Review: https://reviews.apache.org/r/72945 > Recent Nvidia CUDA changes break Mesos GPU support > -- > > Key: MESOS-10192 > URL: https://issues.apache.org/jira/browse/MESOS-10192 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, gpu >Reporter: Greg Mann >Assignee: Qian Zhang >Priority: Major > Labels: GPU, containerization, containerizer, gpu > > Recently it seems that the layout of the Nvidia device files has changed: > https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ > This prevents GPU tasks from launching: > {noformat} > W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container > c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: > Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a > special file: /dev/nvidia-caps > {noformat} > due to this code, which detects the nvidia device files: > https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10157) Add documentation for the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212784#comment-17212784 ] Qian Zhang commented on MESOS-10157: commit 3e1e0b37d6a30a2c98d1227b4ac754b1d26686f3 Author: Qian Zhang Date: Wed Sep 9 10:26:52 2020 +0800 Added doc for the `volume/csi` isolator. Review: https://reviews.apache.org/r/72845 > Add documentation for the `volume/csi` isolator > --- > > Key: MESOS-10157 > URL: https://issues.apache.org/jira/browse/MESOS-10157 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Labels: docs, documentation > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`
[ https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212783#comment-17212783 ] Qian Zhang commented on MESOS-10151: commit 90e5434544da9886cd6f2d87b73e3246292af107 Author: Qian Zhang Date: Tue Oct 13 09:58:44 2020 +0800 Corrected the example of the managed CSI plugin. Review: https://reviews.apache.org/r/72846 > Introduce a new agent flag `--csi_plugin_config_dir` > > > Key: MESOS-10151 > URL: https://issues.apache.org/jira/browse/MESOS-10151 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support
[ https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10192: -- Assignee: Qian Zhang RR: https://reviews.apache.org/r/72945/ > Recent Nvidia CUDA changes break Mesos GPU support > -- > > Key: MESOS-10192 > URL: https://issues.apache.org/jira/browse/MESOS-10192 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, gpu >Reporter: Greg Mann >Assignee: Qian Zhang >Priority: Major > Labels: GPU, containerization, containerizer, gpu > > Recently it seems that the layout of the Nvidia device files has changed: > https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ > This prevents GPU tasks from launching: > {noformat} > W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container > c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: > Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a > special file: /dev/nvidia-caps > {noformat} > due to this code, which detects the nvidia device files: > https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203728#comment-17203728 ] Qian Zhang commented on MESOS-10153: commit 8700dd8d5ece658804d7b7a40863800dcc5c72bc Author: Qian Zhang Date: Sat Sep 19 11:11:04 2020 +0800 Inferred CSI volume's `readonly` field from volume mode. Review: https://reviews.apache.org/r/72888 > Implement the `prepare` method of the `volume/csi` isolator > --- > > Key: MESOS-10153 > URL: https://issues.apache.org/jira/browse/MESOS-10153 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202779#comment-17202779 ] Qian Zhang edited comment on MESOS-10190 at 9/27/20, 8:46 AM: -- [~acecile555] Yes, we set container's hostname to its container ID (in UUID format) by writing the container ID into the `/etc/hostname` file in container's mount namespace and also write the line `container-IP container-ID` into container's `/etc/hosts`, so usually libprocess should be able to get the container's IP. I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are correctly written by Mesos for your containers, you can use gdb to start or attach Mesos agent and step into [this method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997] to check if those files are correctly updated. was (Author: qianzhang): [~acecile555] Yes, we set container's hostname to its container ID (in UUID format) by writing the container ID into the `/etc/hostname` file in container's mount namespace and also write `container-IP container-ID` into container's `/etc/hosts`, so usually libprocess should be able to get the container's IP. I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are correctly written by Mesos for your containers, you can use gdb to start or attach Mesos agent and step into [this method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997] to check if those files are correctly updated. > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202779#comment-17202779 ] Qian Zhang edited comment on MESOS-10190 at 9/27/20, 8:45 AM: -- [~acecile555] Yes, we set container's hostname to its container ID (in UUID format) by writing the container ID into the `/etc/hostname` file in container's mount namespace and also write `container-IP container-ID` into container's `/etc/hosts`, so usually libprocess should be able to get the container's IP. I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are correctly written by Mesos for your containers, you can use gdb to start or attach Mesos agent and step into [this method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997] to check if those files are correctly updated. was (Author: qianzhang): [~acecile555] Yes, we will set container's hostname to its container ID (in UUID format) by writing the container ID into the `/etc/hostname` file in container's mount namespace and also write `container-IP container-ID` into container's `/etc/hosts`, so usually libprocess should be able to get the container's IP. I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are correctly written by Mesos for your containers, you can use gdb to start or attach Mesos agent and step into [this method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997] to check if those files are correctly updated. > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202779#comment-17202779 ] Qian Zhang commented on MESOS-10190: [~acecile555] Yes, we will set container's hostname to its container ID (in UUID format) by writing the container ID into the `/etc/hostname` file in container's mount namespace and also write `container-IP container-ID` into container's `/etc/hosts`, so usually libprocess should be able to get the container's IP. I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are correctly written by Mesos for your containers, you can use gdb to start or attach Mesos agent and step into [this method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997] to check if those files are correctly updated. > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10157) Add documentation for the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10157: -- Assignee: Qian Zhang (was: Greg Mann) RR: https://reviews.apache.org/r/72845/ > Add documentation for the `volume/csi` isolator > --- > > Key: MESOS-10157 > URL: https://issues.apache.org/jira/browse/MESOS-10157 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Labels: docs, documentation > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188949#comment-17188949 ] Qian Zhang commented on MESOS-10153: commit a16f3439dca13982bb4a2b9190c24aaf4eb73b0e Author: Qian Zhang Date: Tue Sep 1 20:58:35 2020 +0800 Moved the `volume/csi` isolator's root dir under work dir. The `volume/csi` isolator needs to checkpoint CSI volume state under work dir rather than runtime dir to be consistent with what volume manager does. Otherwise after agent host is rebooted, volume manager may publish some volumes during recovery, and those volumes will never get chance to be unpublished since the `volume/csi` isolator does not know those volumes at all (the contents in runtime dir will be gone after reboot). Review: https://reviews.apache.org/r/72829 > Implement the `prepare` method of the `volume/csi` isolator > --- > > Key: MESOS-10153 > URL: https://issues.apache.org/jira/browse/MESOS-10153 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC
[ https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188157#comment-17188157 ] Qian Zhang commented on MESOS-10182: [~QuellaZhang] Will you still see that build failure after you edited the file `src/CMakeLists.txt` as I suggested? Or the failure has disappeared and now you can build Mesos code on Windows successfully? > Mesos failed to build due to error C1083: Cannot open include file: > 'csi/state.pb.h': No such file or directory on windows with MSVC > > > Key: MESOS-10182 > URL: https://issues.apache.org/jira/browse/MESOS-10182 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master >Reporter: QuellaZhang >Priority: Major > Attachments: build.log > > > Hi All, > I tried to build Mesos on Windows with VS2019. It failed to build due to > error C1083: Cannot open include file: 'csi/state.pb.h': No such file or > directory on Windows using MSVC. It can be reproduced on latest reversion > d4634f4 on master branch. Could you please take a look at this isssue? Thanks > a lot! > > *Reproduce steps:* > 1. git clone -c core.autocrlf=true https://github.com/apache/mesos > F:\gitP\apache\mesos > 2. Open a VS 2019 x64 command prompt as admin and browse to > F:\gitP\apache\mesos > 3. mkdir build_amd64 && pushd build_amd64 > 4. cmake -G "Visual Studio 16 2019" -A x64 > -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 .. > 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL% > 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln > /t:Rebuild > *ErrorMessage:* > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188074#comment-17188074 ] Qian Zhang commented on MESOS-10153: commit 17f28563488ddaeb2daa60b53bd8dc19e25cddef Author: Qian Zhang Date: Wed Aug 26 10:33:26 2020 +0800 Enabled CSI volume access for non-root users. Review: https://reviews.apache.org/r/72804 > Implement the `prepare` method of the `volume/csi` isolator > --- > > Key: MESOS-10153 > URL: https://issues.apache.org/jira/browse/MESOS-10153 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes
[ https://issues.apache.org/jira/browse/MESOS-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188072#comment-17188072 ] Qian Zhang commented on MESOS-10150: commit ea4099028cfe93e1e2fd80e4d30e03057ec27df1 Author: Qian Zhang Date: Sun Aug 30 10:23:06 2020 +0800 Relaxed unknown volume check when unpublishing volumes. Review: https://reviews.apache.org/r/72820 > Refactor CSI volume manager to support pre-provisioned CSI volumes > -- > > Key: MESOS-10150 > URL: https://issues.apache.org/jira/browse/MESOS-10150 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > Fix For: 1.11.0 > > > The existing > [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138] > is like a wrapper for various CSI gRPC calls, we could consider leveraging > it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` > isolator. But there is a problem, the lifecycle of the volumes managed by > VolumeManager starts from the > `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]` > CSI call, but what we plan to support in MVP is pre-provisioned volumes, so > we need to refactor VolumeManager by making it support pre-provisioned > volumes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC
[ https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187407#comment-17187407 ] Qian Zhang edited comment on MESOS-10182 at 8/31/20, 3:18 AM: -- [~QuellaZhang] Can you please check out the latest code of Mesos master branch and manually edit the file `src/CMakeLists.txt` by moving the line `slave/csi_server.cpp` from line 154 to line 212 (i.e under the line `slave/containerizer/mesos/provisioner/utils.cpp`) and then try again? was (Author: qianzhang): [~QuellaZhang] Can you please check out the latest code of Mesos master branch and manually move the line `slave/csi_server.cpp` from line 154 to line 212 (i.e under the line `slave/containerizer/mesos/provisioner/utils.cpp`) and then try again? > Mesos failed to build due to error C1083: Cannot open include file: > 'csi/state.pb.h': No such file or directory on windows with MSVC > > > Key: MESOS-10182 > URL: https://issues.apache.org/jira/browse/MESOS-10182 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master >Reporter: QuellaZhang >Priority: Major > Attachments: build.log > > > Hi All, > I tried to build Mesos on Windows with VS2019. It failed to build due to > error C1083: Cannot open include file: 'csi/state.pb.h': No such file or > directory on Windows using MSVC. It can be reproduced on latest reversion > d4634f4 on master branch. Could you please take a look at this isssue? Thanks > a lot! > > *Reproduce steps:* > 1. git clone -c core.autocrlf=true https://github.com/apache/mesos > F:\gitP\apache\mesos > 2. Open a VS 2019 x64 command prompt as admin and browse to > F:\gitP\apache\mesos > 3. mkdir build_amd64 && pushd build_amd64 > 4. cmake -G "Visual Studio 16 2019" -A x64 > -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 .. > 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL% > 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln > /t:Rebuild > *ErrorMessage:* > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC
[ https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187407#comment-17187407 ] Qian Zhang commented on MESOS-10182: [~QuellaZhang] Can you please check out the latest code of Mesos master branch and manually move the line `slave/csi_server.cpp` from line 154 to line 212 (i.e under the line `slave/containerizer/mesos/provisioner/utils.cpp`) and then try again? > Mesos failed to build due to error C1083: Cannot open include file: > 'csi/state.pb.h': No such file or directory on windows with MSVC > > > Key: MESOS-10182 > URL: https://issues.apache.org/jira/browse/MESOS-10182 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master >Reporter: QuellaZhang >Priority: Major > Attachments: build.log > > > Hi All, > I tried to build Mesos on Windows with VS2019. It failed to build due to > error C1083: Cannot open include file: 'csi/state.pb.h': No such file or > directory on Windows using MSVC. It can be reproduced on latest reversion > d4634f4 on master branch. Could you please take a look at this isssue? Thanks > a lot! > > *Reproduce steps:* > 1. git clone -c core.autocrlf=true https://github.com/apache/mesos > F:\gitP\apache\mesos > 2. Open a VS 2019 x64 command prompt as admin and browse to > F:\gitP\apache\mesos > 3. mkdir build_amd64 && pushd build_amd64 > 4. cmake -G "Visual Studio 16 2019" -A x64 > -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 .. > 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL% > 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln > /t:Rebuild > *ErrorMessage:* > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open > include file: 'csi/state.pb.h': No such file or directory > (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) > (compiling source file > F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) > [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10148) Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI plugins
[ https://issues.apache.org/jira/browse/MESOS-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183662#comment-17183662 ] Qian Zhang commented on MESOS-10148: commit 2d2265de7df7801612fc2f104f9c8f455a97a1fd Author: Qian Zhang Date: Thu Aug 20 17:08:32 2020 +0800 Introduced the `CSIPluginInfo.target_path_exists` field. Review: https://reviews.apache.org/r/72788 > Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI > plugins > > > Key: MESOS-10148 > URL: https://issues.apache.org/jira/browse/MESOS-10148 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > > See > [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#bookmark=id.x6m8mytigrg7] > for the detailed design. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183661#comment-17183661 ] Qian Zhang commented on MESOS-10155: commit d8647b018fbcfc38ccf0e39bfeae9118e275068f Author: Qian Zhang Date: Thu Aug 20 17:09:36 2020 +0800 Refactored state recovery in `volume/csi` isolator. Read the checkpointed CSI volume state directly in protobuf message way. Review: https://reviews.apache.org/r/72789 > Implement the `recover` method of the `volume/csi` isolator > --- > > Key: MESOS-10155 > URL: https://issues.apache.org/jira/browse/MESOS-10155 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes
[ https://issues.apache.org/jira/browse/MESOS-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179313#comment-17179313 ] Qian Zhang commented on MESOS-10150: commit 014431e3c1b98e514e327318b52e5c54cc6174df Author: Qian Zhang Date: Mon Aug 17 19:22:48 2020 +0800 Updated volume manager to support user specified target path root. Review: https://reviews.apache.org/r/72781 > Refactor CSI volume manager to support pre-provisioned CSI volumes > -- > > Key: MESOS-10150 > URL: https://issues.apache.org/jira/browse/MESOS-10150 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > Fix For: 1.11.0 > > > The existing > [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138] > is like a wrapper for various CSI gRPC calls, we could consider leveraging > it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` > isolator. But there is a problem, the lifecycle of the volumes managed by > VolumeManager starts from the > `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]` > CSI call, but what we plan to support in MVP is pre-provisioned volumes, so > we need to refactor VolumeManager by making it support pre-provisioned > volumes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`
[ https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177405#comment-17177405 ] Qian Zhang commented on MESOS-10151: commit 831f172de7908ad8e40d14905cacb3a9c053e832 Author: Qian Zhang Date: Thu Aug 13 16:37:48 2020 +0800 Updated the help message of the agent flag `--csi_plugin_config_dir`. This is to make the help message of the agent flag `--csi_plugin_config_dir` aligned with the latest protobuf message `CSIPluginInfo`. Review: https://reviews.apache.org/r/72770 > Introduce a new agent flag `--csi_plugin_config_dir` > > > Key: MESOS-10151 > URL: https://issues.apache.org/jira/browse/MESOS-10151 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10175) Improve CSI service manager to set node ID for managed CSI plugins
Qian Zhang created MESOS-10175: -- Summary: Improve CSI service manager to set node ID for managed CSI plugins Key: MESOS-10175 URL: https://issues.apache.org/jira/browse/MESOS-10175 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Qian Zhang For some CSI Plugins (like NFS CSI plugin), their node service need a node ID specified by container orchestrator, see [here|https://github.com/kubernetes-csi/csi-driver-nfs/blob/d94b64bbb3171a45dd91f8686611a062c0dd6219/deploy/kubernetes/csi-nodeplugin-nfsplugin.yaml#L49] for an example, so we need to improve our CSI service manager to set it when launching managed CSI plugins. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10156) Enable the `volume/csi` isolator in UCR
[ https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10156: -- Assignee: Qian Zhang > Enable the `volume/csi` isolator in UCR > --- > > Key: MESOS-10156 > URL: https://issues.apache.org/jira/browse/MESOS-10156 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10155: -- Assignee: Qian Zhang > Implement the `recover` method of the `volume/csi` isolator > --- > > Key: MESOS-10155 > URL: https://issues.apache.org/jira/browse/MESOS-10155 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10154) Implement the `cleanup` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10154: -- Assignee: Qian Zhang > Implement the `cleanup` method of the `volume/csi` isolator > --- > > Key: MESOS-10154 > URL: https://issues.apache.org/jira/browse/MESOS-10154 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls
Qian Zhang created MESOS-10163: -- Summary: Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls Key: MESOS-10163 URL: https://issues.apache.org/jira/browse/MESOS-10163 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Greg Mann *Background:* Originally we want `volume/csi` isolator to leverage the existing [service manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] to launch CSI plugins as standalone containers and currently service manager needs to call the following agent HTTP APIs: # `GET_CONTAINERS` to get all standalone containers in its `recover` method. # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone containers in its `recover` method. # `LAUNCH_CONTAINER` via the existing [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] to launch CSI plugin as standalone container when its `getEndpoint` method is called. The problem with the above design is, `volume/csi` isolator may need to clean up orphan container during agent recovery which is triggered by containerizer (see [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] for details), to clean up an orphan container which is using a CSI volume, `volume/csi` isolator needs to instantiate and recover the service manager and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` method will be called by `volume/csi` isolator during agent recovery. And as I mentioned above service manager’s `getEndpoint` may need to call `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent is still in recovering state, such agent HTTP call will be just rejected by agent. So we have to instantiate and recover service manager *after agent recovery is done*, but in `volume/csi` isolator we do not have such information (i.e. the signal that agent recovery is done). *Solution* We need to implement a new component (like `CSIVolumeManager` or a better name?) in Mesos agent which is responsible for launching CSI plugins as standalone containers (via the existing [service manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) and making CSI gRPC calls (via the existing [volume manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]). * We can instantiate this new component in the `main` method of agent and pass it to both containerizer and agent (i.e. it will be a member of the `Slave` object), and containerizer will in turn pass it to the `volume/csi` isolator. * Since this new component relies on service manager which will call agent HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, agentIP, agentPort, agentLibprocessId + "/api/v1")`, see [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] for an example. * When agent registers/reregisters with master (`Slave::registered` and `Slave::reregistered`), we should call this new component’s `start` method (see [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] and [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] as examples) which will scan the directory `--csi_plugin_config_dir` and create the `service manager - volume manager` pair for each CSI plugin loaded from that directory. * For the `volume/csi` isolator, it needs to call this new component’s `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` method. In the case of clean up orphan containers during agent recovery, `volume/csi` isolator will just call this new component’s `unpublishVolume` method as usual, and it is this new component’s responsibility to only make the actual CSI gRPC call after agent recovery is done and agent has registered with master (e.g., when this new component’s start method is called). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10152) Implement the `create` method of the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10152: -- Assignee: Qian Zhang > Implement the `create` method of the `volume/csi` isolator > -- > > Key: MESOS-10152 > URL: https://issues.apache.org/jira/browse/MESOS-10152 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`
[ https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10151: -- Assignee: Qian Zhang > Introduce a new agent flag `--csi_plugin_config_dir` > > > Key: MESOS-10151 > URL: https://issues.apache.org/jira/browse/MESOS-10151 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10149) Refactor CSI service manager to support unmanaged CSI plugins
[ https://issues.apache.org/jira/browse/MESOS-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10149: -- Assignee: Qian Zhang > Refactor CSI service manager to support unmanaged CSI plugins > - > > Key: MESOS-10149 > URL: https://issues.apache.org/jira/browse/MESOS-10149 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > Refactor [CSI service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L81] > by making it support unmanaged plugins (i.e. the plugin deployed out of > Mesos) and make it’s `getServiceEndpoint` method can also return unmanaged > plugins's endpoint. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10148) Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI plugins
[ https://issues.apache.org/jira/browse/MESOS-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10148: -- Assignee: Qian Zhang RR: [https://reviews.apache.org/r/72661/] > Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI > plugins > > > Key: MESOS-10148 > URL: https://issues.apache.org/jira/browse/MESOS-10148 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > See > [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#bookmark=id.x6m8mytigrg7] > for the detailed design. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10147) Introduce a new volume type `CSI` into the `Volume` protobuf message
[ https://issues.apache.org/jira/browse/MESOS-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10147: -- Assignee: Qian Zhang RR: [https://reviews.apache.org/r/72660/] > Introduce a new volume type `CSI` into the `Volume` protobuf message > > > Key: MESOS-10147 > URL: https://issues.apache.org/jira/browse/MESOS-10147 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > See > [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.l7wa1w8789pg] > for the detailed design. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10157) Add document for the `volume/csi` isolator
Qian Zhang created MESOS-10157: -- Summary: Add document for the `volume/csi` isolator Key: MESOS-10157 URL: https://issues.apache.org/jira/browse/MESOS-10157 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10156) Enable the `volume/csi` isolator in UCR
Qian Zhang created MESOS-10156: -- Summary: Enable the `volume/csi` isolator in UCR Key: MESOS-10156 URL: https://issues.apache.org/jira/browse/MESOS-10156 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator
Qian Zhang created MESOS-10155: -- Summary: Implement the `recover` method of the `volume/csi` isolator Key: MESOS-10155 URL: https://issues.apache.org/jira/browse/MESOS-10155 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10154) Implement the `cleanup` method of the `volume/csi` isolator
Qian Zhang created MESOS-10154: -- Summary: Implement the `cleanup` method of the `volume/csi` isolator Key: MESOS-10154 URL: https://issues.apache.org/jira/browse/MESOS-10154 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator
Qian Zhang created MESOS-10153: -- Summary: Implement the `prepare` method of the `volume/csi` isolator Key: MESOS-10153 URL: https://issues.apache.org/jira/browse/MESOS-10153 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10152) Implement the `create` method of the `volume/csi` isolator
Qian Zhang created MESOS-10152: -- Summary: Implement the `create` method of the `volume/csi` isolator Key: MESOS-10152 URL: https://issues.apache.org/jira/browse/MESOS-10152 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`
[ https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152554#comment-17152554 ] Qian Zhang commented on MESOS-10151: See [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.iobmmefa9bop] for the detailed design. > Introduce a new agent flag `--csi_plugin_config_dir` > > > Key: MESOS-10151 > URL: https://issues.apache.org/jira/browse/MESOS-10151 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10151) Implement the `create` method of the `volume/csi` isolator
Qian Zhang created MESOS-10151: -- Summary: Implement the `create` method of the `volume/csi` isolator Key: MESOS-10151 URL: https://issues.apache.org/jira/browse/MESOS-10151 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes
Qian Zhang created MESOS-10150: -- Summary: Refactor CSI volume manager to support pre-provisioned CSI volumes Key: MESOS-10150 URL: https://issues.apache.org/jira/browse/MESOS-10150 Project: Mesos Issue Type: Task Reporter: Qian Zhang The existing [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138] is like a wrapper for various CSI gRPC calls, we could consider leveraging it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` isolator. But there is a problem, the lifecycle of the volumes managed by VolumeManager starts from the `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]` CSI call, but what we plan to support in MVP is pre-provisioned volumes, so we need to refactor VolumeManager by making it support pre-provisioned volumes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10149) Refactor CSI service manager to support unmanaged CSI plugins
Qian Zhang created MESOS-10149: -- Summary: Refactor CSI service manager to support unmanaged CSI plugins Key: MESOS-10149 URL: https://issues.apache.org/jira/browse/MESOS-10149 Project: Mesos Issue Type: Task Reporter: Qian Zhang Refactor [CSI service manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L81] by making it support unmanaged plugins (i.e. the plugin deployed out of Mesos) and make it’s `getServiceEndpoint` method can also return unmanaged plugins's endpoint. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10148) Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI pluigns
Qian Zhang created MESOS-10148: -- Summary: Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI pluigns Key: MESOS-10148 URL: https://issues.apache.org/jira/browse/MESOS-10148 Project: Mesos Issue Type: Task Reporter: Qian Zhang See [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#bookmark=id.x6m8mytigrg7] for the detailed design. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10147) Introduce a new volume type `CSI` into the `Volume` protobuf message
Qian Zhang created MESOS-10147: -- Summary: Introduce a new volume type `CSI` into the `Volume` protobuf message Key: MESOS-10147 URL: https://issues.apache.org/jira/browse/MESOS-10147 Project: Mesos Issue Type: Task Reporter: Qian Zhang See [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.l7wa1w8789pg] for detailed design. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10142) CSI External Volumes MVP Design Doc
[ https://issues.apache.org/jira/browse/MESOS-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152443#comment-17152443 ] Qian Zhang commented on MESOS-10142: Design doc: https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit?usp=sharing > CSI External Volumes MVP Design Doc > --- > > Key: MESOS-10142 > URL: https://issues.apache.org/jira/browse/MESOS-10142 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Qian Zhang >Priority: Major > Labels: csi, external-volumes, storage > > This ticket tracks the design doc for our initial implementation of external > volume support in Mesos using the CSI standard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure
[ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923 ] Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:37 AM: -- When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all, that means we may hit [this issue|https://askubuntu.com/questions/432809/why-is-kswapd0-running-on-a-computer-with-no-swap/1134491#1134491]: {quote}It is a well known problem that when Linux runs out of memory it can enter swap loops instead of doing what it should be doing, killing processes to free up ram. There are an OOM (Out of Memory) killer that does this but only if Swap and RAM are full. However this should not really be a problem. If there are a bunch of offending processes, for example Firefox and Chrome, each with tabs that are both using and grabbing memory, then these processes will cause swap read back. Linux then enters a loop where the same memory are being moved back and forth between memory and hard drive. This in turn causes priority inversion where swapping a few processes back and forth makes the system unresponsive. {quote} was (Author: qianzhang): When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all. > Mesos agent host may become unresponsive when it is under low memory pressure > - > > Key: MESOS-10139 > URL: https://issues.apache.org/jira/browse/MESOS-10139 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Priority: Major > > When user launches a task to use a large number of memory on an agent host > (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on > an agent host which have 32GB memory), the whole agent host will become > unresponsive (no commands can be executed anymore, but still pingable). A few > minutes later Mesos master will mark this agent as unreachable and update all > its task’s state to `TASK_UNREACHABLE`. > {code:java} > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling > transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE > because of health check timeout > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > … > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating > the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) >
[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure
[ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129929#comment-17129929 ] Qian Zhang commented on MESOS-10139: I asked a [question|https://unix.stackexchange.com/questions/591566/why-does-linux-become-unresponsive-when-a-large-number-of-memory-is-used-oom-ca] in StackExchange for this issue and found actually this is an issue which has been discussed in Linux community for a long time. The solution is running a daemon to monitor the memory pressure and kill or trigger OOM killer to kill a memory hog process when the system is in the low memory condition. [~greggomann] also suggests that we could fix this issue by setting `/sys/fs/cgroups/memory/mesos/memory.limit_in_bytes` to the allocatable memory of the agent (rather than leaving it as the default value) and also ensure that `memory.use_hierarchy` is enabled. And the [current logic|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/containerizer.cpp#L145:L158] to determine the allocatable memory for an agent node may need to be changed, currently in most of the cases we just leave 1GB for system services and all other memory can be offered to frameworks, but for the agent node which have relatively large memory, it may not be enough. For example, for an agent node with 32GB memory, when 29GB memory has been used by tasks, the node may become unresponsive. So I think instead of an absolute value (1GB), we may adopt a relative ratio, like leave 10% of memory for system services and offer the other 90% to frameworks. But we need to figure out a reasonable and safe ratio. > Mesos agent host may become unresponsive when it is under low memory pressure > - > > Key: MESOS-10139 > URL: https://issues.apache.org/jira/browse/MESOS-10139 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Priority: Major > > When user launches a task to use a large number of memory on an agent host > (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on > an agent host which have 32GB memory), the whole agent host will become > unresponsive (no commands can be executed anymore, but still pingable). A few > minutes later Mesos master will mark this agent as unreachable and update all > its task’s state to `TASK_UNREACHABLE`. > {code:java} > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling > transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE > because of health check timeout > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > … > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating > the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating > the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > ...{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure
[ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923 ] Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:29 AM: -- When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all. was (Author: qianzhang): When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all. > Mesos agent host may become unresponsive when it is under low memory pressure > - > > Key: MESOS-10139 > URL: https://issues.apache.org/jira/browse/MESOS-10139 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Priority: Major > > When user launches a task to use a large number of memory on an agent host > (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on > an agent host which have 32GB memory), the whole agent host will become > unresponsive (no commands can be executed anymore, but still pingable). A few > minutes later Mesos master will mark this agent as unreachable and update all > its task’s state to `TASK_UNREACHABLE`. > {code:java} > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling > transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE > because of health check timeout > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > … > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating > the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating > the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > ...{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure
[ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923 ] Qian Zhang commented on MESOS-10139: When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all. > Mesos agent host may become unresponsive when it is under low memory pressure > - > > Key: MESOS-10139 > URL: https://issues.apache.org/jira/browse/MESOS-10139 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Priority: Major > > When user launches a task to use a large number of memory on an agent host > (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on > an agent host which have 32GB memory), the whole agent host will become > unresponsive (no commands can be executed anymore, but still pingable). A few > minutes later Mesos master will mark this agent as unreachable and update all > its task’s state to `TASK_UNREACHABLE`. > {code:java} > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling > transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE > because of health check timeout > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > … > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating > the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating > the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > ...{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure
[ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923 ] Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:28 AM: -- When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all. was (Author: qianzhang): When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0` {code:java} top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05 Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31211.2 total,208.8 free, 30836.6 used,165.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0 ... {code} Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all. > Mesos agent host may become unresponsive when it is under low memory pressure > - > > Key: MESOS-10139 > URL: https://issues.apache.org/jira/browse/MESOS-10139 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Priority: Major > > When user launches a task to use a large number of memory on an agent host > (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on > an agent host which have 32GB memory), the whole agent host will become > unresponsive (no commands can be executed anymore, but still pingable). A few > minutes later Mesos master will mark this agent as unreachable and update all > its task’s state to `TASK_UNREACHABLE`. > {code:java} > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling > transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE > because of health check timeout > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked > agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: > health check timed out > … > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating > the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal > mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating > the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 > of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: > TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) > ...{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure
Qian Zhang created MESOS-10139: -- Summary: Mesos agent host may become unresponsive when it is under low memory pressure Key: MESOS-10139 URL: https://issues.apache.org/jira/browse/MESOS-10139 Project: Mesos Issue Type: Bug Reporter: Qian Zhang When user launches a task to use a large number of memory on an agent host (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on an agent host which have 32GB memory), the whole agent host will become unresponsive (no commands can be executed anymore, but still pingable). A few minutes later Mesos master will mark this agent as unreachable and update all its task’s state to `TASK_UNREACHABLE`. {code:java} May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE because of health check timeout May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out … May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-7884) Support containerd on Mesos.
[ https://issues.apache.org/jira/browse/MESOS-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127810#comment-17127810 ] Qian Zhang commented on MESOS-7884: --- Understood, however to support containerd, we may need to implement another containerizer to integrate with it, that's not our long term plan. We'd rather to keep improving UCR with new features rather than maintaining multiple containerizers. > Support containerd on Mesos. > > > Key: MESOS-7884 > URL: https://issues.apache.org/jira/browse/MESOS-7884 > Project: Mesos > Issue Type: Epic > Components: containerization >Reporter: Gilbert Song >Priority: Major > Labels: containerd, containerizer > > containerd v1.0 is very close (v1.0.0 alpha 4 now) to the formal release. We > should consider support containerd on Mesos, either by refactoring the docker > containerizer or introduce a new containerd containerizer. Design and > suggestions are definitely welcome. > https://github.com/containerd/containerd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation
[ https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118250#comment-17118250 ] Qian Zhang edited comment on MESOS-10126 at 5/29/20, 12:09 PM: --- Master branch: commit 2845330fbd78a80fb7e71c6101724655fa254392 Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] commit b7c3da5a28fb46b4517d52872aec504fff098967 Author: Qian Zhang Date: Sun May 17 23:30:38 2020 +0800 Added a test `ROOT_CommandTaskNoRootfsWithUnmountVolumeFailure`. Review: [https://reviews.apache.org/r/72523] was (Author: qianzhang): commit 2845330fbd78a80fb7e71c6101724655fa254392 Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] commit b7c3da5a28fb46b4517d52872aec504fff098967 Author: Qian Zhang Date: Sun May 17 23:30:38 2020 +0800 Added a test `ROOT_CommandTaskNoRootfsWithUnmountVolumeFailure`. Review: [https://reviews.apache.org/r/72523] > Docker volume isolator needs to clean up the `info` struct regardless the > result of unmount operation > - > > Key: MESOS-10126 > URL: https://issues.apache.org/jira/browse/MESOS-10126 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > Fix For: 1.4.4, 1.5.4, 1.6.3, 1.8.2, 1.9.1, 1.7.4, 1.11.0, 1.10.1 > > > Currently when > [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610] > is called, we will unmount the volume first, but if the unmount operation > fails we will not remove the container's checkpoint directory and NOT erase > the container's `info` struct from `infos`. This is problematic, because the > remaining `info` in the `infos` will cause the reference count of the volume > is larger than 0, but actually the volume is not being used by any > containers. And next time when another container using this volume is > destroyed, we will NOT unmount the volume since its reference count will be > larger than 1 (see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651] > for details) which should be 2, so we will never have chance to unmount this > volume. > We have this issue since Mesos 1.0.0 release when Docker volume isolator was > introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation
[ https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119535#comment-17119535 ] Qian Zhang commented on MESOS-10126: 1.10.x: commit 97251a90d3336bd628c82becca00f545d95b01aa Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.9.x: commit dcce73d57b4d8866fedb3f287d978a135616afb3 Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.8.x: commit cdd3e2924596eecf605eeb73e9c57f23f6643936 Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.7.x: commit 819b9d8345e701321067f3b14ad2bb78b60d285c Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.6.x: commit b0a57116c6794f5d0036ed9c3668f27f29155bd7 Author: Qian Zhang Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the
[jira] [Commented] (MESOS-10130) Docker Manifest list support
[ https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111231#comment-17111231 ] Qian Zhang commented on MESOS-10130: {quote}Btw, should manifest lists be supported by Mesos ? IMO It makes sense because it run on multiple architectures. {quote} Yes, I agree. > Docker Manifest list support > > > Key: MESOS-10130 > URL: https://issues.apache.org/jira/browse/MESOS-10130 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Stéphane Cottin >Priority: Major > Labels: containerization > > Sonatype Nexus 3.22+, and probably other docker registry solutions, now > serves manifest lists. > [https://issues.sonatype.org/browse/NEXUS-18546] > Apache Mesos does not support yet this part of the Image Manifest V2S2 spec. > https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list > This is not a critical issue as Sonatype Nexus is not a dependency of Apache > Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security > issues. > [https://support.sonatype.com/hc/en-us/articles/360046233714] > Apache Mesos should support the whole Image Manifest V2S2 specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10130) Docker Manifest list support
[ https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111207#comment-17111207 ] Qian Zhang commented on MESOS-10130: [~kaalh] I think for multi-arch images in [https://registry-1.docker.io|https://registry-1.docker.io/], it support both manifest list and manifest. For example I can get manifest list and manifest for the image `alpine:latest`: {code:java} $ DH_TOKEN=$(curl -fsSL "https://auth.docker.io/token?service=registry.docker.io=repository:library/alpine:pull; | jq -er '.token') # Get manifest list $ curl -s -S -L -i --raw -H "Authorization: Bearer ${DH_TOKEN}" -H "Accept: application/vnd.docker.distribution.manifest.list.v2+json" -y 60 https://registry-1.docker.io:443/v2/library/alpine/manifests/latest HTTP/1.1 200 Connection establishedHTTP/1.1 200 OK Content-Length: 1638 Content-Type: application/vnd.docker.distribution.manifest.list.v2+json Docker-Content-Digest: sha256:9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54 Docker-Distribution-Api-Version: registry/2.0 Etag: "sha256:9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54" Date: Tue, 19 May 2020 12:58:33 GMT Strict-Transport-Security: max-age=31536000{"manifests":[{"digest":"sha256:39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"amd64","os":"linux"},"size":528},{"digest":"sha256:0ff8a9dffabb5ed8dcba4ee898f62683305b75b4086f433ee722db99138f4f53","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"arm","os":"linux","variant":"v6"},"size":528},{"digest":"sha256:19c4e520fa84832d6deab48cd911067e6d8b0a9fa73fc054c7b9031f1d89e4cf","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"arm","os":"linux","variant":"v7"},"size":528},{"digest":"sha256:ad295e950e71627e9d0d14cdc533f4031d42edae31ab57a841c5b9588eacc280","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"arm64","os":"linux","variant":"v8"},"size":528},{"digest":"sha256:b28e271d721b3f6377cb5bae6cd4506d2736e77ef6f70ed9b0c4716da8bdf17c","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"386","os":"linux"},"size":528},{"digest":"sha256:e095eb9ac24e21bf2621f4d243274197ef12b91c67cde023092301b2db1e073c","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"ppc64le","os":"linux"},"size":528},{"digest":"sha256:41ba0806c6113064dd4cff12212eea3088f40ae23f182763ccc07f430b3a52f8","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"s390x","os":"linux"},"size":528}],"mediaType":"application\/vnd.docker.distribution.manifest.list.v2+json","schemaVersion":2} # Get manifest $ curl -s -S -L -i --raw -H "Authorization: Bearer ${DH_TOKEN}" -H "Accept: application/vnd.docker.distribution.manifest.v2+json" -y 60 https://registry-1.docker.io:443/v2/library/alpine/manifests/latest HTTP/1.1 200 Connection establishedHTTP/1.1 200 OK Content-Length: 528 Content-Type: application/vnd.docker.distribution.manifest.v2+json Docker-Content-Digest: sha256:39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01 Docker-Distribution-Api-Version: registry/2.0 Etag: "sha256:39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01" Date: Tue, 19 May 2020 12:56:23 GMT Strict-Transport-Security: max-age=31536000{ "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "config": { "mediaType": "application/vnd.docker.container.image.v1+json", "size": 1507, "digest": "sha256:f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a" }, "layers": [ { "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", "size": 2813316, "digest": "sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08" } ] }{code} > Docker Manifest list support > > > Key: MESOS-10130 > URL: https://issues.apache.org/jira/browse/MESOS-10130 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Stéphane Cottin >Priority: Major > Labels: containerization > > Sonatype Nexus 3.22+, and probably other docker registry solutions, now > serves manifest lists. > [https://issues.sonatype.org/browse/NEXUS-18546] > Apache Mesos does not support yet this part of the Image Manifest V2S2 spec. > https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list > This is not a critical issue as Sonatype Nexus is not a dependency of Apache > Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security > issues. >
[jira] [Comment Edited] (MESOS-10130) Docker Manifest list support
[ https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111010#comment-17111010 ] Qian Zhang edited comment on MESOS-10130 at 5/19/20, 9:27 AM: -- [~kaalh] Is the use of manifest list in Nexus 3.22+ optional or required? I guess it should be optional (for backward compatibility) so Mesos can still work with it via manifest, right? was (Author: qianzhang): [~kaalh] Is the use of manifest list in Nexus 3.22+ optional or required? I guess it should be optional so Mesos can still work with it via manifest, right? > Docker Manifest list support > > > Key: MESOS-10130 > URL: https://issues.apache.org/jira/browse/MESOS-10130 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Stéphane Cottin >Priority: Major > Labels: containerization > > Sonatype Nexus 3.22+, and probably other docker registry solutions, now > serves manifest lists. > [https://issues.sonatype.org/browse/NEXUS-18546] > Apache Mesos does not support yet this part of the Image Manifest V2S2 spec. > https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list > This is not a critical issue as Sonatype Nexus is not a dependency of Apache > Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security > issues. > [https://support.sonatype.com/hc/en-us/articles/360046233714] > Apache Mesos should support the whole Image Manifest V2S2 specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10130) Docker Manifest list support
[ https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111010#comment-17111010 ] Qian Zhang commented on MESOS-10130: [~kaalh] Is the use of manifest list in Nexus 3.22+ optional or required? I guess it should be optional so Mesos can still work with it via manifest, right? > Docker Manifest list support > > > Key: MESOS-10130 > URL: https://issues.apache.org/jira/browse/MESOS-10130 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Stéphane Cottin >Priority: Major > Labels: containerization > > Sonatype Nexus 3.22+, and probably other docker registry solutions, now > serves manifest lists. > [https://issues.sonatype.org/browse/NEXUS-18546] > Apache Mesos does not support yet this part of the Image Manifest V2S2 spec. > https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list > This is not a critical issue as Sonatype Nexus is not a dependency of Apache > Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security > issues. > [https://support.sonatype.com/hc/en-us/articles/360046233714] > Apache Mesos should support the whole Image Manifest V2S2 specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-7884) Support containerd on Mesos.
[ https://issues.apache.org/jira/browse/MESOS-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110980#comment-17110980 ] Qian Zhang commented on MESOS-7884: --- [~xiaowei-cuc] I do not think we have an immediate plan for this. Can you please let us know your specific use cases? Like why do you need containerd support? What feature are missing with the Docker support in our current Docker containerizer? > Support containerd on Mesos. > > > Key: MESOS-7884 > URL: https://issues.apache.org/jira/browse/MESOS-7884 > Project: Mesos > Issue Type: Epic > Components: containerization >Reporter: Gilbert Song >Priority: Major > Labels: containerd, containerizer > > containerd v1.0 is very close (v1.0.0 alpha 4 now) to the formal release. We > should consider support containerd on Mesos, either by refactoring the docker > containerizer or introduce a new containerd containerizer. Design and > suggestions are definitely welcome. > https://github.com/containerd/containerd -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10127) The sequences used in Docker volume isolator are never erased
[ https://issues.apache.org/jira/browse/MESOS-10127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110321#comment-17110321 ] Qian Zhang commented on MESOS-10127: It seems there is no a proper place in Docker volume isolator's code to erase the sequence. If we erase the sequence after the unmount operation is invoked (like right after [this line|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L658]), and another container tries to use the same volume at the same time (so we need to mount the volume), then unmount and mount operations could happen simultaneously for the same volume which is just what we want to avoid by using the sequence. If we erase the sequence after the unmount operation is complete (like in [DockerVolumeIsolatorProcess::_cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L670]), and another container tries to use the same volume before the sequence is erased but after the unmount operation is complete, then we could erase the sequence when the mount operation is still ongoing which may cause the mount operation is discarded. > The sequences used in Docker volume isolator are never erased > - > > Key: MESOS-10127 > URL: https://issues.apache.org/jira/browse/MESOS-10127 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Priority: Major > > In Docker volume isolator, we use > [sequence|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp#L119:L122] > to make sure the mount and unmount operations for a single volume are issued > serially, but the sequence is never erased which could be a memory leak. > We have this issue since Mesos 1.0.0 release when Docker volume isolator was > introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation
[ https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108245#comment-17108245 ] Qian Zhang edited comment on MESOS-10126 at 5/18/20, 2:49 AM: -- RR: [https://reviews.apache.org/r/72516/] [https://reviews.apache.org/r/72523/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72516/] > Docker volume isolator needs to clean up the `info` struct regardless the > result of unmount operation > - > > Key: MESOS-10126 > URL: https://issues.apache.org/jira/browse/MESOS-10126 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > > Currently when > [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610] > is called, we will unmount the volume first, but if the unmount operation > fails we will not remove the container's checkpoint directory and NOT erase > the container's `info` struct from `infos`. This is problematic, because the > remaining `info` in the `infos` will cause the reference count of the volume > is larger than 0, but actually the volume is not being used by any > containers. And next time when another container using this volume is > destroyed, we will NOT unmount the volume since its reference count will be > larger than 1 (see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651] > for details) which should be 2, so we will never have chance to unmount this > volume. > We have this issue since Mesos 1.0.0 release when Docker volume isolator was > introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation
[ https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10126: -- Sprint: Studio 1: RI-23 68 Story Points: 3 Assignee: Qian Zhang RR: [https://reviews.apache.org/r/72516/] > Docker volume isolator needs to clean up the `info` struct regardless the > result of unmount operation > - > > Key: MESOS-10126 > URL: https://issues.apache.org/jira/browse/MESOS-10126 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > > Currently when > [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610] > is called, we will unmount the volume first, but if the unmount operation > fails we will not remove the container's checkpoint directory and NOT erase > the container's `info` struct from `infos`. This is problematic, because the > remaining `info` in the `infos` will cause the reference count of the volume > is larger than 0, but actually the volume is not being used by any > containers. And next time when another container using this volume is > destroyed, we will NOT unmount the volume since its reference count will be > larger than 1 (see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651] > for details) which should be 2, so we will never have chance to unmount this > volume. > We have this issue since Mesos 1.0.0 release when Docker volume isolator was > introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10127) The sequences used in Docker volume isolator are never erased
Qian Zhang created MESOS-10127: -- Summary: The sequences used in Docker volume isolator are never erased Key: MESOS-10127 URL: https://issues.apache.org/jira/browse/MESOS-10127 Project: Mesos Issue Type: Task Components: containerization Reporter: Qian Zhang In Docker volume isolator, we use [sequence|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp#L119:L122] to make sure the mount and unmount operations for a single volume are issued serially, but the sequence is never erased which could be a memory leak. We have this issue since Mesos 1.0.0 release when Docker volume isolator was introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation
Qian Zhang created MESOS-10126: -- Summary: Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation Key: MESOS-10126 URL: https://issues.apache.org/jira/browse/MESOS-10126 Project: Mesos Issue Type: Task Components: containerization Reporter: Qian Zhang Currently when [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610] is called, we will unmount the volume first, but if the unmount operation fails we will not remove the container's checkpoint directory and NOT erase the container's `info` struct from `infos`. This is problematic, because the remaining `info` in the `infos` will cause the reference count of the volume is larger than 0, but actually the volume is not being used by any containers. And next time when another container using this volume is destroyed, we will NOT unmount the volume since its reference count will be larger than 1 (see [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651] for details) which should be 2, so we will never have chance to unmount this volume. We have this issue since Mesos 1.0.0 release when Docker volume isolator was introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099654#comment-17099654 ] Qian Zhang commented on MESOS-10054: [~mzhu] mentioned that they are using Docker containerizer to launch custom executors in Docker containers, so we still need the above patch, I have reopened and committed it. > Update Docker containerizer to set Docker container’s resource limits and > `oom_score_adj` > - > > Key: MESOS-10054 > URL: https://issues.apache.org/jira/browse/MESOS-10054 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.10.0 > > > This is to set resource limits for executor which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request
[ https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099630#comment-17099630 ] Qian Zhang commented on MESOS-10049: All the code changes about the newly introduced task status reason `REASON_CONTAINER_MEMORY_REQUEST_EXCEEDED` (i.e., the above two patches) have been reverted, so the patch below for details. commit 6bb60a4869394f663a09370016127ae8688cbe06 Author: Qian Zhang Date: Mon Apr 27 22:34:51 2020 +0800 Reverted the changes about `REASON_CONTAINER_MEMORY_REQUEST_EXCEEDED`. The method `MemorySubsystemProcess::oomWaited()` will only be invoked when the container is OOM killed because it uses more memory than its hard memory limit (i.e., the task status reason `REASON_CONTAINER_LIMITATION_MEMORY`), it will NOT be invoked when a burstable container is OOM killed because the agent host is running out of memory, i.e., we will NOT receive OOM killing notification via cgroups notification API for this case. So it is not possible for Mesos to provide a task status reason `REASON_CONTAINER_MEMORY_REQUEST_EXCEEDED` for this case. Review: https://reviews.apache.org/r/72442 > Add a new reason in `TaskStatus::Reason` for the case that a task is > OOM-killed due to exceeding its memory request > --- > > Key: MESOS-10049 > URL: https://issues.apache.org/jira/browse/MESOS-10049 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request
[ https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099626#comment-17099626 ] Qian Zhang commented on MESOS-10049: commit be90edd31a1833c5ed706b39f3a5547ae8153dd2 Author: Greg Mann g...@mesosphere.io Date: Mon Apr 6 15:16:45 2020 -0700 Sent appropriate task status reason when task over memory request. Review: https://reviews.apache.org/r/72305/ > Add a new reason in `TaskStatus::Reason` for the case that a task is > OOM-killed due to exceeding its memory request > --- > > Key: MESOS-10049 > URL: https://issues.apache.org/jira/browse/MESOS-10049 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094262#comment-17094262 ] Qian Zhang commented on MESOS-10053: commit 68ce1476aebe10db7107c0f3dc813af78ec20cef Author: Qian Zhang Date: Mon Apr 27 14:14:15 2020 +0800 Set OOM score adj when Docker container's memory limit is infinite. Review: https://reviews.apache.org/r/72435 > Update Docker executor to set Docker container’s resource limits and > `oom_score_adj` > > > Key: MESOS-10053 > URL: https://issues.apache.org/jira/browse/MESOS-10053 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.10.0 > > > This is to set resource limits for command task which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087845#comment-17087845 ] Qian Zhang edited comment on MESOS-10054 at 4/22/20, 8:32 AM: -- RR: [https://reviews.apache.org/r/72391/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72401/] [https://reviews.apache.org/r/72391/] > Update Docker containerizer to set Docker container’s resource limits and > `oom_score_adj` > - > > Key: MESOS-10054 > URL: https://issues.apache.org/jira/browse/MESOS-10054 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for executor which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-8877) Docker container's resources will be wrongly enlarged in cgroups after agent recovery
[ https://issues.apache.org/jira/browse/MESOS-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-8877: - Story Points: 3 (was: 5) Assignee: Qian Zhang RR: [https://reviews.apache.org/r/72401/] > Docker container's resources will be wrongly enlarged in cgroups after agent > recovery > - > > Key: MESOS-8877 > URL: https://issues.apache.org/jira/browse/MESOS-8877 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.6.1, 1.6.0, 1.5.1, 1.5.0, 1.4.2, 1.4.1, 1.4.0 >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > Labels: containerization > > Reproduce steps: > 1. Run `mesos-execute --master=10.0.49.2:5050 > --task=[file:///home/qzhang/workspace/config/task_docker.json] > --checkpoint=true` to launch a Docker container. > {code:java} > # cat task_docker.json > { > "name": "test", > "task_id": {"value" : "test"}, > "agent_id": {"value" : ""}, > "resources": [ > {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}}, > {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}} > ], > "command": { > "value": "sleep 5" > }, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > 2. When the Docker container is running, we can see its resources in cgroups > are correctly set, so far so good. > {code:java} > # cat > /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/cpu.cfs_quota_us > > 1 > # cat > /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/memory.limit_in_bytes > > 33554432 > {code} > 3. Restart Mesos agent, and then we will see the resources of the Docker > container will be wrongly enlarged. > {code} > I0503 02:06:17.268340 29512 docker.cpp:1855] Updated 'cpu.shares' to 204 at > /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106 > for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88 > I0503 02:06:17.271390 29512 docker.cpp:1882] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container > 1b21295b-2f49-4d08-84c7-43b9ae15ad88 > I0503 02:06:17.273082 29512 docker.cpp:1924] Updated > 'memory.soft_limit_in_bytes' to 64MB for container > 1b21295b-2f49-4d08-84c7-43b9ae15ad88 > I0503 02:06:17.275908 29512 docker.cpp:1950] Updated 'memory.limit_in_bytes' > to 64MB at > /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106 > for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88 > # cat > /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/cpu.cfs_quota_us > 2 > # cat > /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/memory.limit_in_bytes > 67108864 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10117) Update the `usage()` method of containerizer to set resource limits in the `ResourceStatistics` protobuf message
[ https://issues.apache.org/jira/browse/MESOS-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088682#comment-17088682 ] Qian Zhang edited comment on MESOS-10117 at 4/22/20, 8:15 AM: -- RR: [https://reviews.apache.org/r/72398/] [https://reviews.apache.org/r/72399/] [https://reviews.apache.org/r/72402/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72398/] [https://reviews.apache.org/r/72399/] [https://reviews.apache.org/r/72400/] [https://reviews.apache.org/r/72402/] > Update the `usage()` method of containerizer to set resource limits in the > `ResourceStatistics` protobuf message > > > Key: MESOS-10117 > URL: https://issues.apache.org/jira/browse/MESOS-10117 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > In the `ResourceStatistics` protobuf message, there are a couple of issues: > # There are already `cpu_limit` and `mem_limit_bytes` fields, but they are > actually CPU & memory requests when resources limits are specified for a task. > # There is already `mem_soft_limit_bytes` field, but this field seems not > set anywhere. > So we need to update this protobuf message and also the related containerizer > code which set the fields of this protobuf message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087845#comment-17087845 ] Qian Zhang edited comment on MESOS-10054 at 4/21/20, 1:22 PM: -- RR: [https://reviews.apache.org/r/72401/] [https://reviews.apache.org/r/72391/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72391/] > Update Docker containerizer to set Docker container’s resource limits and > `oom_score_adj` > - > > Key: MESOS-10054 > URL: https://issues.apache.org/jira/browse/MESOS-10054 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for executor which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10054: -- Assignee: Qian Zhang > Update Docker containerizer to set Docker container’s resource limits and > `oom_score_adj` > - > > Key: MESOS-10054 > URL: https://issues.apache.org/jira/browse/MESOS-10054 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for executor which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10117) Update the `usage()` method of containerizer to set resource limits in `ResourceStatistics`
Qian Zhang created MESOS-10117: -- Summary: Update the `usage()` method of containerizer to set resource limits in `ResourceStatistics` Key: MESOS-10117 URL: https://issues.apache.org/jira/browse/MESOS-10117 Project: Mesos Issue Type: Task Components: containerization Reporter: Qian Zhang Assignee: Qian Zhang In the `ResourceStatistics` protobuf message, there are a couple of issues: # There are already `cpu_limit` and `mem_limit_bytes` fields, but they are actually CPU & memory requests when resources limits are specified for a task. # There is already `mem_soft_limit_bytes` field, but this field seems not set anywhere. So we need to update this protobuf message and also the related containerizer code which set the fields of this protobuf message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10115) Add document for task resource limits
[ https://issues.apache.org/jira/browse/MESOS-10115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10115: -- Assignee: Greg Mann (was: Qian Zhang) > Add document for task resource limits > - > > Key: MESOS-10115 > URL: https://issues.apache.org/jira/browse/MESOS-10115 > Project: Mesos > Issue Type: Task > Components: documentation >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10115) Add document for task resource limits
Qian Zhang created MESOS-10115: -- Summary: Add document for task resource limits Key: MESOS-10115 URL: https://issues.apache.org/jira/browse/MESOS-10115 Project: Mesos Issue Type: Task Components: documentation Reporter: Qian Zhang Assignee: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10048) Update the memory subsystem in the cgroup isolator to set container’s memory resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065426#comment-17065426 ] Qian Zhang commented on MESOS-10048: https://reviews.apache.org/r/72263/ > Update the memory subsystem in the cgroup isolator to set container’s memory > resource limits and `oom_score_adj` > > > Key: MESOS-10048 > URL: https://issues.apache.org/jira/browse/MESOS-10048 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.10.0 > > > Update the memory subsystem in the cgroup isolator to set container’s memory > resource limits and `oom_score_adj` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10046) Launch executor container with resource limits
[ https://issues.apache.org/jira/browse/MESOS-10046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986931#comment-16986931 ] Qian Zhang edited comment on MESOS-10046 at 3/20/20, 9:13 AM: -- RR: [https://reviews.apache.org/r/71856/] [https://reviews.apache.org/r/71858/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/71856/] > Launch executor container with resource limits > -- > > Key: MESOS-10046 > URL: https://issues.apache.org/jira/browse/MESOS-10046 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.10.0 > > > We need to add resource limits into `ContainerConfig` first, and then set the > resources limits in it according to the executor/task resource limits when > launching executor container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10064) Accommodate the "Infinity" value in JSON
[ https://issues.apache.org/jira/browse/MESOS-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063209#comment-17063209 ] Qian Zhang edited comment on MESOS-10064 at 3/20/20, 9:11 AM: -- commit 0b47b43d290494fc1c6a6f6241ddfbceeb686997 Author: Qian Zhang Date: Sun Feb 23 09:53:32 2020 +0800 Added patch for RapidJSON. This commit updates the writer of RapidJSON to write infinite floating point numbers as "Infinity" and "-Infinity" (i.e., with double quotes) rather than Infinity and -Infinity. This is to ensure the strings converted from JSON objects conform to the rule defined by Protobuf: [https://developers.google.com/protocol-buffers/docs/proto3#json] Review: [https://reviews.apache.org/r/72161] commit ec82a516918ebd663816cb110f73bdee6e5268be Author: Qian Zhang Date: Sun Feb 23 10:09:48 2020 +0800 Accommodated the "Infinity" value in the JSON <-> Protobuf conversion. Review: [https://reviews.apache.org/r/72162] was (Author: qianzhang): commit 0b47b43d290494fc1c6a6f6241ddfbceeb686997 Author: Qian Zhang Date: Sun Feb 23 09:53:32 2020 +0800 Added patch for RapidJSON. This commit updates the writer of RapidJSON to write infinite floating point numbers as "Infinity" and "-Infinity" (i.e., with double quotes) rather than Infinity and -Infinity. This is to ensure the strings converted from JSON objects conform to the rule defined by Protobuf: https://developers.google.com/protocol-buffers/docs/proto3#json Review: [https://reviews.apache.org/r/72161] commit ec82a516918ebd663816cb110f73bdee6e5268be Author: Qian Zhang Date: Sun Feb 23 10:09:48 2020 +0800 Accommodated the "Infinity" value in the JSON <-> Protobuf conversion. Review: [https://reviews.apache.org/r/72162] > Accommodate the "Infinity" value in JSON > > > Key: MESOS-10064 > URL: https://issues.apache.org/jira/browse/MESOS-10064 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.10.0 > > > See > [here|https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?ts=5de78977#heading=h.ejuvxat6x3eb] > for what need to be done for this ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10064) Accommodate the "Infinity" value in JSON
[ https://issues.apache.org/jira/browse/MESOS-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044015#comment-17044015 ] Qian Zhang edited comment on MESOS-10064 at 3/20/20, 9:09 AM: -- RR: [https://reviews.apache.org/r/72161/] [https://reviews.apache.org/r/72161/] was (Author: qianzhang): RR: https://reviews.apache.org/r/72161/ > Accommodate the "Infinity" value in JSON > > > Key: MESOS-10064 > URL: https://issues.apache.org/jira/browse/MESOS-10064 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > See > [here|https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?ts=5de78977#heading=h.ejuvxat6x3eb] > for what need to be done for this ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017766#comment-17017766 ] Qian Zhang edited comment on MESOS-10053 at 3/7/20, 9:04 AM: - RR: [https://reviews.apache.org/r/72022/] [https://reviews.apache.org/r/72027/] [https://reviews.apache.org/r/72211/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72022/] [https://reviews.apache.org/r/72027/] > Update Docker executor to set Docker container’s resource limits and > `oom_score_adj` > > > Key: MESOS-10053 > URL: https://issues.apache.org/jira/browse/MESOS-10053 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for command task which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits
[ https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989493#comment-16989493 ] Qian Zhang edited comment on MESOS-10047 at 3/7/20, 9:00 AM: - RR: [https://reviews.apache.org/r/71886/] [https://reviews.apache.org/r/71953/] [https://reviews.apache.org/r/71955/] [https://reviews.apache.org/r/71956/] [https://reviews.apache.org/r/72210/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/71886/] [https://reviews.apache.org/r/71953/] [https://reviews.apache.org/r/71955/] [https://reviews.apache.org/r/71956/] > Update the CPU subsystem in the cgroup isolator to set container's CPU > resource limits > -- > > Key: MESOS-10047 > URL: https://issues.apache.org/jira/browse/MESOS-10047 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10064) Accommodate the "Infinity" value in JSON
[ https://issues.apache.org/jira/browse/MESOS-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10064: -- Assignee: Qian Zhang > Accommodate the "Infinity" value in JSON > > > Key: MESOS-10064 > URL: https://issues.apache.org/jira/browse/MESOS-10064 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > See > [here|https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?ts=5de78977#heading=h.ejuvxat6x3eb] > for what need to be done for this ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10051) Update the `LaunchContainer` agent API to support container resource limits
[ https://issues.apache.org/jira/browse/MESOS-10051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10051: -- Sprint: Studio 1: RI-23 64 Story Points: 2 Assignee: Qian Zhang RR: https://reviews.apache.org/r/72040/ > Update the `LaunchContainer` agent API to support container resource limits > --- > > Key: MESOS-10051 > URL: https://issues.apache.org/jira/browse/MESOS-10051 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10063) Update default executor to call `LAUNCH_CONTAINER` to launch nested containers
[ https://issues.apache.org/jira/browse/MESOS-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10063: -- Sprint: Studio 1: RI-23 64 Story Points: 2 Assignee: Qian Zhang RR: [https://reviews.apache.org/r/72041/] > Update default executor to call `LAUNCH_CONTAINER` to launch nested containers > -- > > Key: MESOS-10063 > URL: https://issues.apache.org/jira/browse/MESOS-10063 > Project: Mesos > Issue Type: Task > Components: executor >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > The default executor will be updated to use the LAUNCH_CONTAINER call instead > of the LAUNCH_NESTED_CONTAINER call when launching nested containers. This > will allow the default executor to set task limits when launching its task > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017766#comment-17017766 ] Qian Zhang edited comment on MESOS-10053 at 1/20/20 8:59 AM: - RR: [https://reviews.apache.org/r/72022/] [https://reviews.apache.org/r/72027/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/72022/] > Update Docker executor to set Docker container’s resource limits and > `oom_score_adj` > > > Key: MESOS-10053 > URL: https://issues.apache.org/jira/browse/MESOS-10053 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for command task which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10053: -- Sprint: Studio 1: RI-23 64 Story Points: 3 Assignee: Qian Zhang RR: [https://reviews.apache.org/r/72022/] > Update Docker executor to set Docker container’s resource limits and > `oom_score_adj` > > > Key: MESOS-10053 > URL: https://issues.apache.org/jira/browse/MESOS-10053 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This is to set resource limits for command task which will run as a Docker > container. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10087) Updated master & agent's HTTP endpoints for showing resource limits
Qian Zhang created MESOS-10087: -- Summary: Updated master & agent's HTTP endpoints for showing resource limits Key: MESOS-10087 URL: https://issues.apache.org/jira/browse/MESOS-10087 Project: Mesos Issue Type: Task Reporter: Qian Zhang We need to update Mesos master's `/state`, `/frameworks`, `/tasks` endpoints and agent's `/state` endpoint to show task's resource limits in their outputs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10050) Update the `update()` method of containerizer to handle container resource limits
[ https://issues.apache.org/jira/browse/MESOS-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008336#comment-17008336 ] Qian Zhang edited comment on MESOS-10050 at 1/11/20 2:59 PM: - RR: [https://reviews.apache.org/r/71950/] [https://reviews.apache.org/r/71951/] [https://reviews.apache.org/r/71952/] [https://reviews.apache.org/r/71983/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/71950/] [https://reviews.apache.org/r/71951/] [https://reviews.apache.org/r/71952/] > Update the `update()` method of containerizer to handle container resource > limits > - > > Key: MESOS-10050 > URL: https://issues.apache.org/jira/browse/MESOS-10050 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits
[ https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989493#comment-16989493 ] Qian Zhang edited comment on MESOS-10047 at 1/6/20 8:50 AM: RR: [https://reviews.apache.org/r/71886/] [https://reviews.apache.org/r/71953/] [https://reviews.apache.org/r/71955/] [https://reviews.apache.org/r/71956/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/71886/] [https://reviews.apache.org/r/71953/] > Update the CPU subsystem in the cgroup isolator to set container's CPU > resource limits > -- > > Key: MESOS-10047 > URL: https://issues.apache.org/jira/browse/MESOS-10047 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits
[ https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989493#comment-16989493 ] Qian Zhang edited comment on MESOS-10047 at 1/6/20 6:22 AM: RR: [https://reviews.apache.org/r/71886/] [https://reviews.apache.org/r/71953/] was (Author: qianzhang): RR: [https://reviews.apache.org/r/71886/] > Update the CPU subsystem in the cgroup isolator to set container's CPU > resource limits > -- > > Key: MESOS-10047 > URL: https://issues.apache.org/jira/browse/MESOS-10047 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10050) Update the `update()` method of containerizer to handle container resource limits
[ https://issues.apache.org/jira/browse/MESOS-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-10050: -- Sprint: Studio 1: RI-22 62 Story Points: 5 Assignee: Qian Zhang > Update the `update()` method of containerizer to handle container resource > limits > - > > Key: MESOS-10050 > URL: https://issues.apache.org/jira/browse/MESOS-10050 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10050) Update the `update()` method of containerizer to handle container resource limits
[ https://issues.apache.org/jira/browse/MESOS-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008336#comment-17008336 ] Qian Zhang commented on MESOS-10050: RR: [https://reviews.apache.org/r/71950/] [https://reviews.apache.org/r/71951/] [https://reviews.apache.org/r/71952/] > Update the `update()` method of containerizer to handle container resource > limits > - > > Key: MESOS-10050 > URL: https://issues.apache.org/jira/browse/MESOS-10050 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10048) Update the memory subsystem in the cgroup isolator to set container’s memory resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006525#comment-17006525 ] Qian Zhang commented on MESOS-10048: RR: [https://reviews.apache.org/r/71943/] [https://reviews.apache.org/r/71944/] > Update the memory subsystem in the cgroup isolator to set container’s memory > resource limits and `oom_score_adj` > > > Key: MESOS-10048 > URL: https://issues.apache.org/jira/browse/MESOS-10048 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > [|https://reviews.apache.org/r/71944/] -- This message was sent by Atlassian Jira (v8.3.4#803005)