[jira] [Comment Edited] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos

2022-08-26 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585253#comment-17585253
 ] 

Qian Zhang edited comment on MESOS-10234 at 8/26/22 9:09 AM:
-

According to 
[https://blogs.apache.org/security/entry/cve-2021-44228|https://blogs.apache.org/security/entry/cve-2021-44228,]
 , it seems ZooKeeper is not affected by 
[CVE-2021-44228|https://www.cve.org/CVERecord?id=CVE-2021-44228].


was (Author: qianzhang):
According to [https://blogs.apache.org/security/entry/cve-2021-44228,] it seems 
ZooKeeper is not affected by 
[CVE-2021-44228|https://www.cve.org/CVERecord?id=CVE-2021-44228].

> CVE-2021-44228 Log4j vulnerability for apache mesos
> ---
>
> Key: MESOS-10234
> URL: https://issues.apache.org/jira/browse/MESOS-10234
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11.0
>Reporter: Sangita Nalkar
>Priority: Critical
>
> Hi,
> Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache 
> mesos.
> We see that log4j v1.2.17 is used while building apache mesos from source.
> Snippet from build logs:
> std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF 
> jvm/org/apache/.deps/libjava_la-log4j.Tpo -c 
> ../../src/jvm/org/apache/log4j.cpp  -fPIC -DPIC -o 
> jvm/org/apache/.libs/libjava_la-log4j.o
> Thanks,
> Sangita



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos

2022-08-26 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585253#comment-17585253
 ] 

Qian Zhang commented on MESOS-10234:


According to [https://blogs.apache.org/security/entry/cve-2021-44228,] it seems 
ZooKeeper is not affected by 
[CVE-2021-44228|https://www.cve.org/CVERecord?id=CVE-2021-44228].

> CVE-2021-44228 Log4j vulnerability for apache mesos
> ---
>
> Key: MESOS-10234
> URL: https://issues.apache.org/jira/browse/MESOS-10234
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11.0
>Reporter: Sangita Nalkar
>Priority: Critical
>
> Hi,
> Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache 
> mesos.
> We see that log4j v1.2.17 is used while building apache mesos from source.
> Snippet from build logs:
> std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF 
> jvm/org/apache/.deps/libjava_la-log4j.Tpo -c 
> ../../src/jvm/org/apache/log4j.cpp  -fPIC -DPIC -o 
> jvm/org/apache/.libs/libjava_la-log4j.o
> Thanks,
> Sangita



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-22 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367313#comment-17367313
 ] 

Qian Zhang commented on MESOS-10224:


[~surahman] I think it has been fixed in this PR 
([https://github.com/apache/mesos/pull/384)] by [~cf.natali] recently, can you 
please get the latest Mesos code and try again?

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-21 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366893#comment-17366893
 ] 

Qian Zhang edited comment on MESOS-10224 at 6/21/21, 11:34 PM:
---

[~surahman] Thanks for reporting the issue!

Can you please run the following command to get the detailed error messages for 
the failed tests?
{code:java}
sudo ./bin/mesos-tests.sh --gtest_filter="" --verbose{code}


was (Author: qianzhang):
[~surahman] Thanks for reporting the issue!

Can you please run the following command to get the detailed error messages for 
the failed test?
{code:java}
sudo ./bin/mesos-tests.sh --gtest_filter="" --verbose{code}

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-21 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366893#comment-17366893
 ] 

Qian Zhang commented on MESOS-10224:


[~surahman] Thanks for reporting the issue!

Can you please run the following command to get the detailed error messages for 
the failed test?
{code:java}
sudo ./bin/mesos-tests.sh --gtest_filter="" --verbose{code}

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses

2021-06-13 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10222:
--

  Assignee: Charles Natali
Resolution: Fixed

> Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
> ---
>
> Key: MESOS-10222
> URL: https://issues.apache.org/jira/browse/MESOS-10222
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Charles Natali
>Priority: Minor
> Attachments: config.log
>
>
> I am trying to build Mesos master but it fails with:
>  
> {code:java}
>  In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38,
>  from 
> ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12,
>  from ../../3rdparty/stout/include/stout/uuid.hpp:21,
>  from ../../include/mesos/type_utils.hpp:36,
>  from ../../src/master/flags.cpp:18:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27,
>  from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30,
>  from ../../3rdparty/stout/include/stout/numify.hpp:19,
>  from ../../3rdparty/stout/include/stout/duration.hpp:29,
>  from ../../3rdparty/libprocess/include/process/time.hpp:18,
>  from ../../3rdparty/libprocess/include/process/clock.hpp:18,
>  from ../../3rdparty/libprocess/include/process/future.hpp:29,
>  from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/local/local.cpp:24:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11,
>  from ../../include/mesos/resources.hpp:27,
>  from ../../src/master/master.hpp:31,
>  from ../../src/master/framework.cpp:17:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from 

[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses

2021-06-10 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360932#comment-17360932
 ] 

Qian Zhang commented on MESOS-10222:


Not yet, we are still working on [https://github.com/apache/mesos/pull/392].

> Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
> ---
>
> Key: MESOS-10222
> URL: https://issues.apache.org/jira/browse/MESOS-10222
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
> Attachments: config.log
>
>
> I am trying to build Mesos master but it fails with:
>  
> {code:java}
>  In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38,
>  from 
> ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12,
>  from ../../3rdparty/stout/include/stout/uuid.hpp:21,
>  from ../../include/mesos/type_utils.hpp:36,
>  from ../../src/master/flags.cpp:18:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27,
>  from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30,
>  from ../../3rdparty/stout/include/stout/numify.hpp:19,
>  from ../../3rdparty/stout/include/stout/duration.hpp:29,
>  from ../../3rdparty/libprocess/include/process/time.hpp:18,
>  from ../../3rdparty/libprocess/include/process/clock.hpp:18,
>  from ../../3rdparty/libprocess/include/process/future.hpp:29,
>  from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/local/local.cpp:24:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11,
>  from ../../include/mesos/resources.hpp:27,
>  from ../../src/master/master.hpp:31,
>  from ../../src/master/framework.cpp:17:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from 

[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2021-06-10 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360928#comment-17360928
 ] 

Qian Zhang commented on MESOS-8400:
---

I see there are still two patches not merged yet:

[https://reviews.apache.org/r/71384]
[https://reviews.apache.org/r/71385]

[~bbannier] Can you please comment? Do we still need these two patches?

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce

2021-05-31 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354466#comment-17354466
 ] 

Qian Zhang commented on MESOS-10220:


Resolved by https://github.com/apache/mesos/pull/384.

> ldcache::parse failed to parse newer ld.so.cahce
> 
>
> Key: MESOS-10220
> URL: https://issues.apache.org/jira/browse/MESOS-10220
> Project: Mesos
>  Issue Type: Bug
>Reporter: Minh H.G.
>Assignee: Charles Natali
>Priority: Minor
>
> In glibc 2.31, the ld.so.cache file no longer support old format (the one 
> start with "ld.so-1.7.0")
> That cause ldcache::parse to fail and mesos cannot start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support

2020-10-12 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212793#comment-17212793
 ] 

Qian Zhang commented on MESOS-10192:


commit 301902be4f1332799cf3b3242cd29b4907c21c09
Author: Qian Zhang 
Date: Sat Oct 10 15:04:57 2020 +0800

Ignored the directoy `/dev/nvidia-caps` when globing Nvidia GPU devices.
 
 The directory `/dev/nvidia-caps` was introduced in CUDA 11.0, just
 ignore it since we only care about the Nvidia GPU device files.
 
 Review: https://reviews.apache.org/r/72945

> Recent Nvidia CUDA changes break Mesos GPU support
> --
>
> Key: MESOS-10192
> URL: https://issues.apache.org/jira/browse/MESOS-10192
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, gpu
>Reporter: Greg Mann
>Assignee: Qian Zhang
>Priority: Major
>  Labels: GPU, containerization, containerizer, gpu
>
> Recently it seems that the layout of the Nvidia device files has changed:  
> https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
> This prevents GPU tasks from launching:
> {noformat}
> W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container 
> c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: 
> Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a 
> special file: /dev/nvidia-caps
> {noformat}
> due to this code, which detects the nvidia device files: 
> https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10157) Add documentation for the `volume/csi` isolator

2020-10-12 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212784#comment-17212784
 ] 

Qian Zhang commented on MESOS-10157:


commit 3e1e0b37d6a30a2c98d1227b4ac754b1d26686f3
Author: Qian Zhang 
Date: Wed Sep 9 10:26:52 2020 +0800

Added doc for the `volume/csi` isolator.
 
 Review: https://reviews.apache.org/r/72845

> Add documentation for the `volume/csi` isolator
> ---
>
> Key: MESOS-10157
> URL: https://issues.apache.org/jira/browse/MESOS-10157
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: docs, documentation
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`

2020-10-12 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212783#comment-17212783
 ] 

Qian Zhang commented on MESOS-10151:


commit 90e5434544da9886cd6f2d87b73e3246292af107
Author: Qian Zhang 
Date: Tue Oct 13 09:58:44 2020 +0800

Corrected the example of the managed CSI plugin.
 
 Review: https://reviews.apache.org/r/72846

> Introduce a new agent flag `--csi_plugin_config_dir`
> 
>
> Key: MESOS-10151
> URL: https://issues.apache.org/jira/browse/MESOS-10151
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support

2020-10-10 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10192:
--

Assignee: Qian Zhang

RR: https://reviews.apache.org/r/72945/

> Recent Nvidia CUDA changes break Mesos GPU support
> --
>
> Key: MESOS-10192
> URL: https://issues.apache.org/jira/browse/MESOS-10192
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, gpu
>Reporter: Greg Mann
>Assignee: Qian Zhang
>Priority: Major
>  Labels: GPU, containerization, containerizer, gpu
>
> Recently it seems that the layout of the Nvidia device files has changed:  
> https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
> This prevents GPU tasks from launching:
> {noformat}
> W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container 
> c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: 
> Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a 
> special file: /dev/nvidia-caps
> {noformat}
> due to this code, which detects the nvidia device files: 
> https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator

2020-09-29 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203728#comment-17203728
 ] 

Qian Zhang commented on MESOS-10153:


commit 8700dd8d5ece658804d7b7a40863800dcc5c72bc
Author: Qian Zhang 
Date: Sat Sep 19 11:11:04 2020 +0800

Inferred CSI volume's `readonly` field from volume mode.
 
 Review: https://reviews.apache.org/r/72888

> Implement the `prepare` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10153
> URL: https://issues.apache.org/jira/browse/MESOS-10153
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-27 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202779#comment-17202779
 ] 

Qian Zhang edited comment on MESOS-10190 at 9/27/20, 8:46 AM:
--

[~acecile555] Yes, we set container's hostname to its container ID (in UUID 
format) by writing the container ID into the `/etc/hostname` file in 
container's mount namespace and also write the line `container-IP    
container-ID` into container's `/etc/hosts`, so usually libprocess should be 
able to get the container's IP.

I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are 
correctly written by Mesos for your containers, you can use gdb to start or 
attach Mesos agent and step into [this 
method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997]
 to check if those files are correctly updated.


was (Author: qianzhang):
[~acecile555] Yes, we set container's hostname to its container ID (in UUID 
format) by writing the container ID into the `/etc/hostname` file in 
container's mount namespace and also write `container-IP    container-ID` into 
container's `/etc/hosts`, so usually libprocess should be able to get the 
container's IP.

I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are 
correctly written by Mesos for your containers, you can use gdb to start or 
attach Mesos agent and step into [this 
method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997]
 to check if those files are correctly updated.

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-27 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202779#comment-17202779
 ] 

Qian Zhang edited comment on MESOS-10190 at 9/27/20, 8:45 AM:
--

[~acecile555] Yes, we set container's hostname to its container ID (in UUID 
format) by writing the container ID into the `/etc/hostname` file in 
container's mount namespace and also write `container-IP    container-ID` into 
container's `/etc/hosts`, so usually libprocess should be able to get the 
container's IP.

I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are 
correctly written by Mesos for your containers, you can use gdb to start or 
attach Mesos agent and step into [this 
method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997]
 to check if those files are correctly updated.


was (Author: qianzhang):
[~acecile555] Yes, we will set container's hostname to its container ID (in 
UUID format) by writing the container ID into the `/etc/hostname` file in 
container's mount namespace and also write `container-IP    container-ID` into 
container's `/etc/hosts`, so usually libprocess should be able to get the 
container's IP.

I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are 
correctly written by Mesos for your containers, you can use gdb to start or 
attach Mesos agent and step into [this 
method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997]
 to check if those files are correctly updated.

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-27 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202779#comment-17202779
 ] 

Qian Zhang commented on MESOS-10190:


[~acecile555] Yes, we will set container's hostname to its container ID (in 
UUID format) by writing the container ID into the `/etc/hostname` file in 
container's mount namespace and also write `container-IP    container-ID` into 
container's `/etc/hosts`, so usually libprocess should be able to get the 
container's IP.

I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are 
correctly written by Mesos for your containers, you can use gdb to start or 
attach Mesos agent and step into [this 
method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997]
 to check if those files are correctly updated.

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10157) Add documentation for the `volume/csi` isolator

2020-09-08 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10157:
--

Assignee: Qian Zhang  (was: Greg Mann)

RR: https://reviews.apache.org/r/72845/

> Add documentation for the `volume/csi` isolator
> ---
>
> Key: MESOS-10157
> URL: https://issues.apache.org/jira/browse/MESOS-10157
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: docs, documentation
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator

2020-09-01 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188949#comment-17188949
 ] 

Qian Zhang commented on MESOS-10153:


commit a16f3439dca13982bb4a2b9190c24aaf4eb73b0e
Author: Qian Zhang 
Date: Tue Sep 1 20:58:35 2020 +0800

Moved the `volume/csi` isolator's root dir under work dir.
 
 The `volume/csi` isolator needs to checkpoint CSI volume state under
 work dir rather than runtime dir to be consistent with what volume
 manager does. Otherwise after agent host is rebooted, volume manager
 may publish some volumes during recovery, and those volumes will never
 get chance to be unpublished since the `volume/csi` isolator does not
 know those volumes at all (the contents in runtime dir will be gone
 after reboot).
 
 Review: https://reviews.apache.org/r/72829

> Implement the `prepare` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10153
> URL: https://issues.apache.org/jira/browse/MESOS-10153
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC

2020-08-31 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188157#comment-17188157
 ] 

Qian Zhang commented on MESOS-10182:


[~QuellaZhang] Will you still see that build failure after you edited the file 
`src/CMakeLists.txt` as I suggested? Or the failure has disappeared and now you 
can build Mesos code on Windows successfully?

> Mesos failed to build due to error C1083: Cannot open include file: 
> 'csi/state.pb.h': No such file or directory on windows with MSVC
> 
>
> Key: MESOS-10182
> URL: https://issues.apache.org/jira/browse/MESOS-10182
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
>Reporter: QuellaZhang
>Priority: Major
> Attachments: build.log
>
>
> Hi All,
> I tried to build Mesos on Windows with VS2019. It failed to build due to 
> error C1083: Cannot open include file: 'csi/state.pb.h': No such file or 
> directory on Windows using MSVC. It can be reproduced on latest reversion 
> d4634f4 on master branch. Could you please take a look at this isssue? Thanks 
> a lot!
>  
> *Reproduce steps:*
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\gitP\apache\mesos
> 2. Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
> 3. mkdir build_amd64 && pushd build_amd64
> 4. cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL%
> 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
> *ErrorMessage:*
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator

2020-08-31 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188074#comment-17188074
 ] 

Qian Zhang commented on MESOS-10153:


commit 17f28563488ddaeb2daa60b53bd8dc19e25cddef
Author: Qian Zhang 
Date: Wed Aug 26 10:33:26 2020 +0800

Enabled CSI volume access for non-root users.
 
 Review: https://reviews.apache.org/r/72804

> Implement the `prepare` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10153
> URL: https://issues.apache.org/jira/browse/MESOS-10153
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes

2020-08-31 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188072#comment-17188072
 ] 

Qian Zhang commented on MESOS-10150:


commit ea4099028cfe93e1e2fd80e4d30e03057ec27df1
Author: Qian Zhang 
Date: Sun Aug 30 10:23:06 2020 +0800

Relaxed unknown volume check when unpublishing volumes.
 
 Review: https://reviews.apache.org/r/72820

> Refactor CSI volume manager to support pre-provisioned CSI volumes
> --
>
> Key: MESOS-10150
> URL: https://issues.apache.org/jira/browse/MESOS-10150
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
> Fix For: 1.11.0
>
>
> The existing 
> [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138]
>  is like a wrapper for various CSI gRPC calls, we could consider leveraging 
> it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` 
> isolator. But there is a problem, the lifecycle of the volumes managed by 
> VolumeManager starts from the 
> `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]`
>  CSI call, but what we plan to support in MVP is pre-provisioned volumes, so 
> we need to refactor VolumeManager by making it support pre-provisioned 
> volumes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC

2020-08-30 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187407#comment-17187407
 ] 

Qian Zhang edited comment on MESOS-10182 at 8/31/20, 3:18 AM:
--

[~QuellaZhang] Can you please check out the latest code of Mesos master branch 
and manually edit the file `src/CMakeLists.txt` by moving the line 
`slave/csi_server.cpp` from line 154 to line 212 (i.e under the line 
`slave/containerizer/mesos/provisioner/utils.cpp`) and then try again?


was (Author: qianzhang):
[~QuellaZhang] Can you please check out the latest code of Mesos master branch 
and manually move the line `slave/csi_server.cpp` from line 154 to line 212 
(i.e under the line `slave/containerizer/mesos/provisioner/utils.cpp`) and then 
try again?

> Mesos failed to build due to error C1083: Cannot open include file: 
> 'csi/state.pb.h': No such file or directory on windows with MSVC
> 
>
> Key: MESOS-10182
> URL: https://issues.apache.org/jira/browse/MESOS-10182
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
>Reporter: QuellaZhang
>Priority: Major
> Attachments: build.log
>
>
> Hi All,
> I tried to build Mesos on Windows with VS2019. It failed to build due to 
> error C1083: Cannot open include file: 'csi/state.pb.h': No such file or 
> directory on Windows using MSVC. It can be reproduced on latest reversion 
> d4634f4 on master branch. Could you please take a look at this isssue? Thanks 
> a lot!
>  
> *Reproduce steps:*
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\gitP\apache\mesos
> 2. Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
> 3. mkdir build_amd64 && pushd build_amd64
> 4. cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL%
> 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
> *ErrorMessage:*
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10182) Mesos failed to build due to error C1083: Cannot open include file: 'csi/state.pb.h': No such file or directory on windows with MSVC

2020-08-30 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187407#comment-17187407
 ] 

Qian Zhang commented on MESOS-10182:


[~QuellaZhang] Can you please check out the latest code of Mesos master branch 
and manually move the line `slave/csi_server.cpp` from line 154 to line 212 
(i.e under the line `slave/containerizer/mesos/provisioner/utils.cpp`) and then 
try again?

> Mesos failed to build due to error C1083: Cannot open include file: 
> 'csi/state.pb.h': No such file or directory on windows with MSVC
> 
>
> Key: MESOS-10182
> URL: https://issues.apache.org/jira/browse/MESOS-10182
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
>Reporter: QuellaZhang
>Priority: Major
> Attachments: build.log
>
>
> Hi All,
> I tried to build Mesos on Windows with VS2019. It failed to build due to 
> error C1083: Cannot open include file: 'csi/state.pb.h': No such file or 
> directory on Windows using MSVC. It can be reproduced on latest reversion 
> d4634f4 on master branch. Could you please take a look at this isssue? Thanks 
> a lot!
>  
> *Reproduce steps:*
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\gitP\apache\mesos
> 2. Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
> 3. mkdir build_amd64 && pushd build_amd64
> 4. cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5. set CL=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING %CL%
> 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
> *ErrorMessage:*
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\csi_server.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher_tracker.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\mesos\launcher.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\containerizer\composing.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file F:\gitP\apache\mesos\src\slave\slave.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
> F:\gitP\apache\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot open 
> include file: 'csi/state.pb.h': No such file or directory 
> (d:\agent\_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\p0prepro.c:1969) 
> (compiling source file 
> F:\gitP\apache\mesos\src\slave\task_status_update_manager.cpp) 
> [F:\gitP\apache\mesos\build_amd64\src\mesos.vcxproj]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10148) Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI plugins

2020-08-24 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183662#comment-17183662
 ] 

Qian Zhang commented on MESOS-10148:


commit 2d2265de7df7801612fc2f104f9c8f455a97a1fd
Author: Qian Zhang 
Date: Thu Aug 20 17:08:32 2020 +0800

Introduced the `CSIPluginInfo.target_path_exists` field.
 
 Review: https://reviews.apache.org/r/72788

> Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI 
> plugins
> 
>
> Key: MESOS-10148
> URL: https://issues.apache.org/jira/browse/MESOS-10148
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>
> See 
> [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#bookmark=id.x6m8mytigrg7]
>  for the detailed design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator

2020-08-24 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183661#comment-17183661
 ] 

Qian Zhang commented on MESOS-10155:


commit d8647b018fbcfc38ccf0e39bfeae9118e275068f
Author: Qian Zhang 
Date: Thu Aug 20 17:09:36 2020 +0800

Refactored state recovery in `volume/csi` isolator.
 
 Read the checkpointed CSI volume state directly in protobuf message way.
 
 Review: https://reviews.apache.org/r/72789

> Implement the `recover` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10155
> URL: https://issues.apache.org/jira/browse/MESOS-10155
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes

2020-08-17 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179313#comment-17179313
 ] 

Qian Zhang commented on MESOS-10150:


commit 014431e3c1b98e514e327318b52e5c54cc6174df
Author: Qian Zhang 
Date: Mon Aug 17 19:22:48 2020 +0800

Updated volume manager to support user specified target path root.
 
 Review: https://reviews.apache.org/r/72781

> Refactor CSI volume manager to support pre-provisioned CSI volumes
> --
>
> Key: MESOS-10150
> URL: https://issues.apache.org/jira/browse/MESOS-10150
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
> Fix For: 1.11.0
>
>
> The existing 
> [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138]
>  is like a wrapper for various CSI gRPC calls, we could consider leveraging 
> it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` 
> isolator. But there is a problem, the lifecycle of the volumes managed by 
> VolumeManager starts from the 
> `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]`
>  CSI call, but what we plan to support in MVP is pre-provisioned volumes, so 
> we need to refactor VolumeManager by making it support pre-provisioned 
> volumes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`

2020-08-13 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177405#comment-17177405
 ] 

Qian Zhang commented on MESOS-10151:


commit 831f172de7908ad8e40d14905cacb3a9c053e832
Author: Qian Zhang 
Date: Thu Aug 13 16:37:48 2020 +0800

Updated the help message of the agent flag `--csi_plugin_config_dir`.
 
 This is to make the help message of the agent flag `--csi_plugin_config_dir`
 aligned with the latest protobuf message `CSIPluginInfo`.
 
 Review: https://reviews.apache.org/r/72770

> Introduce a new agent flag `--csi_plugin_config_dir`
> 
>
> Key: MESOS-10151
> URL: https://issues.apache.org/jira/browse/MESOS-10151
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10175) Improve CSI service manager to set node ID for managed CSI plugins

2020-08-12 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10175:
--

 Summary: Improve CSI service manager to set node ID for managed 
CSI plugins
 Key: MESOS-10175
 URL: https://issues.apache.org/jira/browse/MESOS-10175
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang
Assignee: Qian Zhang


For some CSI Plugins (like NFS CSI plugin), their node service need a node ID 
specified by container orchestrator, see 
[here|https://github.com/kubernetes-csi/csi-driver-nfs/blob/d94b64bbb3171a45dd91f8686611a062c0dd6219/deploy/kubernetes/csi-nodeplugin-nfsplugin.yaml#L49]
 for an example, so we need to improve our CSI service manager to set it when 
launching managed CSI plugins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-08-09 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10156:
--

Assignee: Qian Zhang

> Enable the `volume/csi` isolator in UCR
> ---
>
> Key: MESOS-10156
> URL: https://issues.apache.org/jira/browse/MESOS-10156
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator

2020-08-09 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10155:
--

Assignee: Qian Zhang

> Implement the `recover` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10155
> URL: https://issues.apache.org/jira/browse/MESOS-10155
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10154) Implement the `cleanup` method of the `volume/csi` isolator

2020-08-05 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10154:
--

Assignee: Qian Zhang

> Implement the `cleanup` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10154
> URL: https://issues.apache.org/jira/browse/MESOS-10154
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls

2020-07-23 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10163:
--

 Summary: Implement a new component to launch CSI plugins as 
standalone containers and make CSI gRPC calls
 Key: MESOS-10163
 URL: https://issues.apache.org/jira/browse/MESOS-10163
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang
Assignee: Greg Mann


*Background:*

Originally we want `volume/csi` isolator to leverage the existing [service 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]
 to launch CSI plugins as standalone containers and currently service manager 
needs to call the following agent HTTP APIs:
 # `GET_CONTAINERS` to get all standalone containers in its `recover` method.
 # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone 
containers in its `recover` method.
 # `LAUNCH_CONTAINER` via the existing 
[ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46]
 to launch CSI plugin as standalone container when its `getEndpoint` method is 
called.

The problem with the above design is, `volume/csi` isolator may need to clean 
up orphan container during agent recovery which is triggered by containerizer 
(see 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275]
 for details), to clean up an orphan container which is using a CSI volume, 
`volume/csi` isolator needs to instantiate and recover the service manager and 
get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` method 
will be called by `volume/csi` isolator during agent recovery. And as I 
mentioned above service manager’s `getEndpoint` may need to call 
`LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent is 
still in recovering state, such agent HTTP call will be just rejected by agent. 
So we have to instantiate and recover service manager *after agent recovery is 
done*, but in `volume/csi` isolator we do not have such information (i.e. the 
signal that agent recovery is done).

 

*Solution*

We need to implement a new component (like `CSIVolumeManager` or a better 
name?) in Mesos agent which is responsible for launching CSI plugins as 
standalone containers (via the existing [service 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51])
 and making CSI gRPC calls (via the existing [volume 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]).
 * We can instantiate this new component in the `main` method of agent and pass 
it to both containerizer and agent (i.e. it will be a member of the `Slave` 
object), and containerizer will in turn pass it to the `volume/csi` isolator.
 * Since this new component relies on service manager which will call agent 
HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, 
agentIP, agentPort, agentLibprocessId + "/api/v1")`, see 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471]
 for an example.
 * When agent registers/reregisters with master (`Slave::registered` and 
`Slave::reregistered`), we should call this new component’s `start` method (see 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742]
 and 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827]
 as examples) which will scan the directory `--csi_plugin_config_dir` and 
create the `service manager - volume manager` pair for each CSI plugin loaded 
from that directory.
 * For the `volume/csi` isolator, it needs to call this new component’s 
`publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` 
method.

In the case of clean up orphan containers during agent recovery, `volume/csi` 
isolator will just call this new component’s `unpublishVolume` method as usual, 
and it is this new component’s responsibility to only make the actual CSI gRPC 
call after agent recovery is done and agent has registered with master (e.g., 
when this new component’s start method is called).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10152) Implement the `create` method of the `volume/csi` isolator

2020-07-17 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10152:
--

Assignee: Qian Zhang

> Implement the `create` method of the `volume/csi` isolator
> --
>
> Key: MESOS-10152
> URL: https://issues.apache.org/jira/browse/MESOS-10152
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`

2020-07-13 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10151:
--

Assignee: Qian Zhang

> Introduce a new agent flag `--csi_plugin_config_dir`
> 
>
> Key: MESOS-10151
> URL: https://issues.apache.org/jira/browse/MESOS-10151
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10149) Refactor CSI service manager to support unmanaged CSI plugins

2020-07-09 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10149:
--

Assignee: Qian Zhang

> Refactor CSI service manager to support unmanaged CSI plugins
> -
>
> Key: MESOS-10149
> URL: https://issues.apache.org/jira/browse/MESOS-10149
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> Refactor [CSI service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L81]
>  by making it support unmanaged plugins (i.e. the plugin deployed out of 
> Mesos) and make it’s `getServiceEndpoint` method can also return unmanaged 
> plugins's endpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10148) Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI plugins

2020-07-08 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10148:
--

Assignee: Qian Zhang

RR: [https://reviews.apache.org/r/72661/]

> Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI 
> plugins
> 
>
> Key: MESOS-10148
> URL: https://issues.apache.org/jira/browse/MESOS-10148
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> See 
> [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#bookmark=id.x6m8mytigrg7]
>  for the detailed design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10147) Introduce a new volume type `CSI` into the `Volume` protobuf message

2020-07-08 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10147:
--

Assignee: Qian Zhang

RR: [https://reviews.apache.org/r/72660/]

> Introduce a new volume type `CSI` into the `Volume` protobuf message
> 
>
> Key: MESOS-10147
> URL: https://issues.apache.org/jira/browse/MESOS-10147
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> See 
> [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.l7wa1w8789pg]
>  for the detailed design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10157) Add document for the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10157:
--

 Summary: Add document for the `volume/csi` isolator
 Key: MESOS-10157
 URL: https://issues.apache.org/jira/browse/MESOS-10157
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10156:
--

 Summary: Enable the `volume/csi` isolator in UCR
 Key: MESOS-10156
 URL: https://issues.apache.org/jira/browse/MESOS-10156
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10155:
--

 Summary: Implement the `recover` method of the `volume/csi` 
isolator
 Key: MESOS-10155
 URL: https://issues.apache.org/jira/browse/MESOS-10155
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10154) Implement the `cleanup` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10154:
--

 Summary: Implement the `cleanup` method of the `volume/csi` 
isolator
 Key: MESOS-10154
 URL: https://issues.apache.org/jira/browse/MESOS-10154
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10153:
--

 Summary: Implement the `prepare` method of the `volume/csi` 
isolator
 Key: MESOS-10153
 URL: https://issues.apache.org/jira/browse/MESOS-10153
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10152) Implement the `create` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10152:
--

 Summary: Implement the `create` method of the `volume/csi` isolator
 Key: MESOS-10152
 URL: https://issues.apache.org/jira/browse/MESOS-10152
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`

2020-07-07 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152554#comment-17152554
 ] 

Qian Zhang commented on MESOS-10151:


See 
[here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.iobmmefa9bop]
 for the detailed design.

> Introduce a new agent flag `--csi_plugin_config_dir`
> 
>
> Key: MESOS-10151
> URL: https://issues.apache.org/jira/browse/MESOS-10151
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10151) Implement the `create` method of the `volume/csi` isolator

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10151:
--

 Summary: Implement the `create` method of the `volume/csi` isolator
 Key: MESOS-10151
 URL: https://issues.apache.org/jira/browse/MESOS-10151
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10150:
--

 Summary: Refactor CSI volume manager to support pre-provisioned 
CSI volumes
 Key: MESOS-10150
 URL: https://issues.apache.org/jira/browse/MESOS-10150
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


The existing 
[VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138]
 is like a wrapper for various CSI gRPC calls, we could consider leveraging it 
to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` 
isolator. But there is a problem, the lifecycle of the volumes managed by 
VolumeManager starts from the 
`[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]`
 CSI call, but what we plan to support in MVP is pre-provisioned volumes, so we 
need to refactor VolumeManager by making it support pre-provisioned volumes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10149) Refactor CSI service manager to support unmanaged CSI plugins

2020-07-07 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10149:
--

 Summary: Refactor CSI service manager to support unmanaged CSI 
plugins
 Key: MESOS-10149
 URL: https://issues.apache.org/jira/browse/MESOS-10149
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


Refactor [CSI service 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L81]
 by making it support unmanaged plugins (i.e. the plugin deployed out of Mesos) 
and make it’s `getServiceEndpoint` method can also return unmanaged plugins's 
endpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10148) Update the `CSIPluginInfo` protobuf message for supporting 3rd party CSI pluigns

2020-07-06 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10148:
--

 Summary: Update the `CSIPluginInfo` protobuf message for 
supporting 3rd party CSI pluigns
 Key: MESOS-10148
 URL: https://issues.apache.org/jira/browse/MESOS-10148
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


See 
[here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#bookmark=id.x6m8mytigrg7]
 for the detailed design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10147) Introduce a new volume type `CSI` into the `Volume` protobuf message

2020-07-06 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10147:
--

 Summary: Introduce a new volume type `CSI` into the `Volume` 
protobuf message
 Key: MESOS-10147
 URL: https://issues.apache.org/jira/browse/MESOS-10147
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


See 
[here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.l7wa1w8789pg]
 for detailed design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10142) CSI External Volumes MVP Design Doc

2020-07-06 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152443#comment-17152443
 ] 

Qian Zhang commented on MESOS-10142:


Design doc: 
https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit?usp=sharing

> CSI External Volumes MVP Design Doc
> ---
>
> Key: MESOS-10142
> URL: https://issues.apache.org/jira/browse/MESOS-10142
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Qian Zhang
>Priority: Major
>  Labels: csi, external-volumes, storage
>
> This ticket tracks the design doc for our initial implementation of external 
> volume support in Mesos using the CSI standard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

2020-06-09 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923
 ] 

Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:37 AM:
--

When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries 
to page out the executable code of some processes and OOM killer is not 
triggered at all, that means we may hit [this 
issue|https://askubuntu.com/questions/432809/why-is-kswapd0-running-on-a-computer-with-no-swap/1134491#1134491]:
{quote}It is a well known problem that when Linux runs out of memory it can 
enter swap loops instead of doing what it should be doing, killing processes to 
free up ram. There are an OOM (Out of Memory) killer that does this but only if 
Swap and RAM are full.

However this should not really be a problem. If there are a bunch of offending 
processes, for example Firefox and Chrome, each with tabs that are both using 
and grabbing memory, then these processes will cause swap read back. Linux then 
enters a loop where the same memory are being moved back and forth between 
memory and hard drive. This in turn causes priority inversion where swapping a 
few processes back and forth makes the system unresponsive.
{quote}


was (Author: qianzhang):
When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries 
to page out the executable code of some processes and OOM killer is not 
triggered at all. 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> 

[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

2020-06-09 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129929#comment-17129929
 ] 

Qian Zhang commented on MESOS-10139:


I asked a 
[question|https://unix.stackexchange.com/questions/591566/why-does-linux-become-unresponsive-when-a-large-number-of-memory-is-used-oom-ca]
 in StackExchange for this issue and found actually this is an issue which has 
been discussed in Linux community for a long time. The solution is running a 
daemon to monitor the memory pressure and kill or trigger OOM killer to kill a 
memory hog process when the system is in the low memory condition.

[~greggomann] also suggests that we could fix this issue by setting 
`/sys/fs/cgroups/memory/mesos/memory.limit_in_bytes` to the allocatable memory 
of the agent (rather than leaving it as the default value) and also ensure that 
`memory.use_hierarchy` is enabled. And the [current 
logic|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/containerizer.cpp#L145:L158]
 to determine the allocatable memory for an agent node may need to be changed, 
currently in most of the cases we just leave 1GB for system services and all 
other memory can be offered to frameworks, but for the agent node which have 
relatively large memory, it may not be enough. For example, for an agent node 
with 32GB memory, when 29GB memory has been used by tasks, the node may become 
unresponsive. So I think instead of an absolute value (1GB), we may adopt a 
relative ratio, like leave 10% of memory for system services and offer the 
other 90% to frameworks. But we need to figure out a reasonable and safe ratio.

> Mesos agent host may become unresponsive when it is under low memory pressure
> -
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

2020-06-09 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923
 ] 

Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:29 AM:
--

When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries 
to page out the executable code of some processes and OOM killer is not 
triggered at all. 


was (Author: qianzhang):
When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` 
tries to page out the executable code of some processes and OOM killer is not 
triggered at all. 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

2020-06-09 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923
 ] 

Qian Zhang commented on MESOS-10139:


When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` 
tries to page out the executable code of some processes and OOM killer is not 
triggered at all.

 

 

 

 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

2020-06-09 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129923#comment-17129923
 ] 

Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:28 AM:
--

When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` 
tries to page out the executable code of some processes and OOM killer is not 
triggered at all. 


was (Author: qianzhang):
When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,208.8 free,  30836.6 used,165.8 buff/cache
MiB Swap:  0.0 total,  0.0 free,  0.0 used.  1.4 avail Mem   
PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND   

   
  103 root  20   0   0 0 0  R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` 
tries to page out the executable code of some processes and OOM killer is not 
triggered at all.

 

 

 

 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212- (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

2020-06-09 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10139:
--

 Summary: Mesos agent host may become unresponsive when it is under 
low memory pressure
 Key: MESOS-10139
 URL: https://issues.apache.org/jira/browse/MESOS-10139
 Project: Mesos
  Issue Type: Bug
Reporter: Qian Zhang


When user launches a task to use a large number of memory on an agent host 
(e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on an 
agent host which have 32GB memory), the whole agent host will become 
unresponsive (no commands can be executed anymore, but still pingable). A few 
minutes later Mesos master will mark this agent as unreachable and update all 
its task’s state to `TASK_UNREACHABLE`.
{code:java}
May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: 
I0526 02:13:31.103382 15491 master.cpp:260] Scheduling transition of agent 
89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE because of health check 
timeout
May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: 
I0526 02:13:31.103612 15491 master.cpp:8592] Marking agent 
89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health 
check timed out
May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: 
I0526 02:13:31.108093 15495 master.cpp:8635] Marked agent 
89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health 
check timed out
…
May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: 
I0526 02:13:31.108419 15495 master.cpp:11149] Updating the state of task 
app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 of framework 
89d2d679-fa08-49be-94c3-880ebb595212- (latest state: TASK_UNREACHABLE, 
status update state: TASK_UNREACHABLE)
May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: 
I0526 02:13:31.108865 15495 master.cpp:11149] Updating the state of task 
app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 of framework 
89d2d679-fa08-49be-94c3-880ebb595212- (latest state: TASK_UNREACHABLE, 
status update state: TASK_UNREACHABLE)
...{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-7884) Support containerd on Mesos.

2020-06-07 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127810#comment-17127810
 ] 

Qian Zhang commented on MESOS-7884:
---

Understood, however to support containerd, we may need to implement another 
containerizer to integrate with it, that's not our long term plan. We'd rather 
to keep improving UCR with new features rather than maintaining multiple 
containerizers.

> Support containerd on Mesos.
> 
>
> Key: MESOS-7884
> URL: https://issues.apache.org/jira/browse/MESOS-7884
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: containerd, containerizer
>
> containerd v1.0 is very close (v1.0.0 alpha 4 now) to the formal release. We 
> should consider support containerd on Mesos, either by refactoring the docker 
> containerizer or introduce a new containerd containerizer. Design and 
> suggestions are definitely welcome.
> https://github.com/containerd/containerd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation

2020-05-29 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118250#comment-17118250
 ] 

Qian Zhang edited comment on MESOS-10126 at 5/29/20, 12:09 PM:
---

Master branch:

commit 2845330fbd78a80fb7e71c6101724655fa254392
 Author: Qian Zhang 
 Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.

Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the volume is not being used by
 any containers. That means we may never get a chance to unmount this volume
 on this agent, furthermore if it is an EBS volume, it cannot be used by any
 tasks launched on any other agents since a EBS volume can only be attached
 to one node at a time. The only workaround would manually unmount the volume.

So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase
 container's `Info` struct before unmounting volumes.

Review: [https://reviews.apache.org/r/72516]

 

commit b7c3da5a28fb46b4517d52872aec504fff098967
 Author: Qian Zhang 
 Date: Sun May 17 23:30:38 2020 +0800

Added a test `ROOT_CommandTaskNoRootfsWithUnmountVolumeFailure`.

Review: [https://reviews.apache.org/r/72523]


was (Author: qianzhang):
commit 2845330fbd78a80fb7e71c6101724655fa254392
Author: Qian Zhang 
Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.
 
 Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the volume is not being used by
 any containers. That means we may never get a chance to unmount this volume
 on this agent, furthermore if it is an EBS volume, it cannot be used by any
 tasks launched on any other agents since a EBS volume can only be attached
 to one node at a time. The only workaround would manually unmount the volume.
 
 So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase
 container's `Info` struct before unmounting volumes.
 
 Review: [https://reviews.apache.org/r/72516]

 

commit b7c3da5a28fb46b4517d52872aec504fff098967
Author: Qian Zhang 
Date: Sun May 17 23:30:38 2020 +0800

Added a test `ROOT_CommandTaskNoRootfsWithUnmountVolumeFailure`.
 
 Review: [https://reviews.apache.org/r/72523]

> Docker volume isolator needs to clean up the `info` struct regardless the 
> result of unmount operation
> -
>
> Key: MESOS-10126
> URL: https://issues.apache.org/jira/browse/MESOS-10126
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
> Fix For: 1.4.4, 1.5.4, 1.6.3, 1.8.2, 1.9.1, 1.7.4, 1.11.0, 1.10.1
>
>
> Currently when 
> [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610]
>  is called, we will unmount the volume first, but if the unmount operation 
> fails we will not remove the container's checkpoint directory and NOT erase 
> the container's `info` struct from `infos`. This is problematic, because the 
> remaining `info` in the `infos` will cause the reference count of the volume 
> is larger than 0, but actually the volume is not being used by any 
> containers. And next time when another container using this volume is 
> destroyed, we will NOT unmount the volume since its reference count will be 
> larger than 1 (see 
> [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651]
>  for details) which should be 2, so we will never have chance to unmount this 
> volume.
> We have this issue since Mesos 1.0.0 release when Docker volume isolator was 
> introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation

2020-05-29 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119535#comment-17119535
 ] 

Qian Zhang commented on MESOS-10126:


1.10.x:

commit 97251a90d3336bd628c82becca00f545d95b01aa
Author: Qian Zhang 
Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.
 
 Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the volume is not being used by
 any containers. That means we may never get a chance to unmount this volume
 on this agent, furthermore if it is an EBS volume, it cannot be used by any
 tasks launched on any other agents since a EBS volume can only be attached
 to one node at a time. The only workaround would manually unmount the volume.
 
 So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase
 container's `Info` struct before unmounting volumes.
 
 Review: [https://reviews.apache.org/r/72516]

 

1.9.x:

commit dcce73d57b4d8866fedb3f287d978a135616afb3
Author: Qian Zhang 
Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.
 
 Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the volume is not being used by
 any containers. That means we may never get a chance to unmount this volume
 on this agent, furthermore if it is an EBS volume, it cannot be used by any
 tasks launched on any other agents since a EBS volume can only be attached
 to one node at a time. The only workaround would manually unmount the volume.
 
 So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase
 container's `Info` struct before unmounting volumes.
 
 Review: [https://reviews.apache.org/r/72516]

 

1.8.x:

commit cdd3e2924596eecf605eeb73e9c57f23f6643936
Author: Qian Zhang 
Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.
 
 Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the volume is not being used by
 any containers. That means we may never get a chance to unmount this volume
 on this agent, furthermore if it is an EBS volume, it cannot be used by any
 tasks launched on any other agents since a EBS volume can only be attached
 to one node at a time. The only workaround would manually unmount the volume.
 
 So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase
 container's `Info` struct before unmounting volumes.
 
 Review: [https://reviews.apache.org/r/72516]

 

1.7.x:

commit 819b9d8345e701321067f3b14ad2bb78b60d285c
Author: Qian Zhang 
Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.
 
 Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the volume is not being used by
 any containers. That means we may never get a chance to unmount this volume
 on this agent, furthermore if it is an EBS volume, it cannot be used by any
 tasks launched on any other agents since a EBS volume can only be attached
 to one node at a time. The only workaround would manually unmount the volume.
 
 So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase
 container's `Info` struct before unmounting volumes.
 
 Review: [https://reviews.apache.org/r/72516]

 

1.6.x:

commit b0a57116c6794f5d0036ed9c3668f27f29155bd7
Author: Qian Zhang 
Date: Fri May 15 10:23:51 2020 +0800

Erased `Info` struct before unmouting volumes in Docker volume isolator.
 
 Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will
 unmount the volume first, and if the unmount operation fails we will NOT
 erase the container's `Info` struct from `infos`. This is problematic
 because the remaining `Info` in `infos` will cause the reference count of
 the volume is greater than 0, but actually the 

[jira] [Commented] (MESOS-10130) Docker Manifest list support

2020-05-19 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111231#comment-17111231
 ] 

Qian Zhang commented on MESOS-10130:


{quote}Btw, should manifest lists be supported by Mesos ? IMO It makes sense 
because it run on multiple architectures.
{quote}
Yes, I agree.

> Docker Manifest list support
> 
>
> Key: MESOS-10130
> URL: https://issues.apache.org/jira/browse/MESOS-10130
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Stéphane Cottin
>Priority: Major
>  Labels: containerization
>
> Sonatype Nexus 3.22+, and probably other docker registry solutions, now 
> serves manifest lists.
> [https://issues.sonatype.org/browse/NEXUS-18546]
> Apache Mesos does not support yet this part of the Image Manifest V2S2 spec.
> https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list
> This is not a critical issue as Sonatype Nexus is not a dependency of Apache 
> Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security 
> issues.
> [https://support.sonatype.com/hc/en-us/articles/360046233714]
> Apache Mesos should support the whole Image Manifest V2S2 specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10130) Docker Manifest list support

2020-05-19 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111207#comment-17111207
 ] 

Qian Zhang commented on MESOS-10130:


[~kaalh] I think for multi-arch images in 
[https://registry-1.docker.io|https://registry-1.docker.io/], it support both 
manifest list and manifest. For example I can get manifest list and manifest 
for the image `alpine:latest`:
{code:java}
$ DH_TOKEN=$(curl -fsSL 
"https://auth.docker.io/token?service=registry.docker.io=repository:library/alpine:pull;
 | jq -er '.token')


# Get manifest list
$ curl -s -S -L -i --raw -H "Authorization: Bearer ${DH_TOKEN}" -H "Accept: 
application/vnd.docker.distribution.manifest.list.v2+json" -y 60 
https://registry-1.docker.io:443/v2/library/alpine/manifests/latest 
HTTP/1.1 200 Connection establishedHTTP/1.1 200 OK
Content-Length: 1638
Content-Type: application/vnd.docker.distribution.manifest.list.v2+json
Docker-Content-Digest: 
sha256:9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54
Docker-Distribution-Api-Version: registry/2.0
Etag: "sha256:9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54"
Date: Tue, 19 May 2020 12:58:33 GMT
Strict-Transport-Security: 
max-age=31536000{"manifests":[{"digest":"sha256:39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"amd64","os":"linux"},"size":528},{"digest":"sha256:0ff8a9dffabb5ed8dcba4ee898f62683305b75b4086f433ee722db99138f4f53","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"arm","os":"linux","variant":"v6"},"size":528},{"digest":"sha256:19c4e520fa84832d6deab48cd911067e6d8b0a9fa73fc054c7b9031f1d89e4cf","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"arm","os":"linux","variant":"v7"},"size":528},{"digest":"sha256:ad295e950e71627e9d0d14cdc533f4031d42edae31ab57a841c5b9588eacc280","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"arm64","os":"linux","variant":"v8"},"size":528},{"digest":"sha256:b28e271d721b3f6377cb5bae6cd4506d2736e77ef6f70ed9b0c4716da8bdf17c","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"386","os":"linux"},"size":528},{"digest":"sha256:e095eb9ac24e21bf2621f4d243274197ef12b91c67cde023092301b2db1e073c","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"ppc64le","os":"linux"},"size":528},{"digest":"sha256:41ba0806c6113064dd4cff12212eea3088f40ae23f182763ccc07f430b3a52f8","mediaType":"application\/vnd.docker.distribution.manifest.v2+json","platform":{"architecture":"s390x","os":"linux"},"size":528}],"mediaType":"application\/vnd.docker.distribution.manifest.list.v2+json","schemaVersion":2}


# Get manifest
$ curl -s -S -L -i --raw -H "Authorization: Bearer ${DH_TOKEN}" -H "Accept: 
application/vnd.docker.distribution.manifest.v2+json" -y 60 
https://registry-1.docker.io:443/v2/library/alpine/manifests/latest
HTTP/1.1 200 Connection establishedHTTP/1.1 200 OK
Content-Length: 528
Content-Type: application/vnd.docker.distribution.manifest.v2+json
Docker-Content-Digest: 
sha256:39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01
Docker-Distribution-Api-Version: registry/2.0
Etag: "sha256:39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01"
Date: Tue, 19 May 2020 12:56:23 GMT
Strict-Transport-Security: max-age=31536000{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
  "mediaType": "application/vnd.docker.container.image.v1+json",
  "size": 1507,
  "digest": 
"sha256:f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a"
   },
   "layers": [
  {
 "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
 "size": 2813316,
 "digest": 
"sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08"
  }
   ]
}{code}

> Docker Manifest list support
> 
>
> Key: MESOS-10130
> URL: https://issues.apache.org/jira/browse/MESOS-10130
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Stéphane Cottin
>Priority: Major
>  Labels: containerization
>
> Sonatype Nexus 3.22+, and probably other docker registry solutions, now 
> serves manifest lists.
> [https://issues.sonatype.org/browse/NEXUS-18546]
> Apache Mesos does not support yet this part of the Image Manifest V2S2 spec.
> https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list
> This is not a critical issue as Sonatype Nexus is not a dependency of Apache 
> Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security 
> issues.
> 

[jira] [Comment Edited] (MESOS-10130) Docker Manifest list support

2020-05-19 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111010#comment-17111010
 ] 

Qian Zhang edited comment on MESOS-10130 at 5/19/20, 9:27 AM:
--

[~kaalh] Is the use of manifest list in Nexus 3.22+ optional or required? I 
guess it should be optional (for backward compatibility) so Mesos can still 
work with it via manifest, right?


was (Author: qianzhang):
[~kaalh] Is the use of manifest list in Nexus 3.22+ optional or required? I 
guess it should be optional so Mesos can still work with it via manifest, right?

> Docker Manifest list support
> 
>
> Key: MESOS-10130
> URL: https://issues.apache.org/jira/browse/MESOS-10130
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Stéphane Cottin
>Priority: Major
>  Labels: containerization
>
> Sonatype Nexus 3.22+, and probably other docker registry solutions, now 
> serves manifest lists.
> [https://issues.sonatype.org/browse/NEXUS-18546]
> Apache Mesos does not support yet this part of the Image Manifest V2S2 spec.
> https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list
> This is not a critical issue as Sonatype Nexus is not a dependency of Apache 
> Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security 
> issues.
> [https://support.sonatype.com/hc/en-us/articles/360046233714]
> Apache Mesos should support the whole Image Manifest V2S2 specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10130) Docker Manifest list support

2020-05-19 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111010#comment-17111010
 ] 

Qian Zhang commented on MESOS-10130:


[~kaalh] Is the use of manifest list in Nexus 3.22+ optional or required? I 
guess it should be optional so Mesos can still work with it via manifest, right?

> Docker Manifest list support
> 
>
> Key: MESOS-10130
> URL: https://issues.apache.org/jira/browse/MESOS-10130
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Stéphane Cottin
>Priority: Major
>  Labels: containerization
>
> Sonatype Nexus 3.22+, and probably other docker registry solutions, now 
> serves manifest lists.
> [https://issues.sonatype.org/browse/NEXUS-18546]
> Apache Mesos does not support yet this part of the Image Manifest V2S2 spec.
> https://docs.docker.com/registry/spec/manifest-v2-2/#manifest-list
> This is not a critical issue as Sonatype Nexus is not a dependency of Apache 
> Mesos, but as we cannot use Nexus > 3.21.2, this leads to side security 
> issues.
> [https://support.sonatype.com/hc/en-us/articles/360046233714]
> Apache Mesos should support the whole Image Manifest V2S2 specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-7884) Support containerd on Mesos.

2020-05-19 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110980#comment-17110980
 ] 

Qian Zhang commented on MESOS-7884:
---

[~xiaowei-cuc] I do not think we have an immediate plan for this. Can you 
please let us know your specific use cases? Like why do you need containerd 
support? What feature are missing with the Docker support in our current Docker 
containerizer?

> Support containerd on Mesos.
> 
>
> Key: MESOS-7884
> URL: https://issues.apache.org/jira/browse/MESOS-7884
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Major
>  Labels: containerd, containerizer
>
> containerd v1.0 is very close (v1.0.0 alpha 4 now) to the formal release. We 
> should consider support containerd on Mesos, either by refactoring the docker 
> containerizer or introduce a new containerd containerizer. Design and 
> suggestions are definitely welcome.
> https://github.com/containerd/containerd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10127) The sequences used in Docker volume isolator are never erased

2020-05-18 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110321#comment-17110321
 ] 

Qian Zhang commented on MESOS-10127:


It seems there is no a proper place in Docker volume isolator's code to erase 
the sequence.

If we erase the sequence after the unmount operation is invoked (like right 
after [this 
line|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L658]),
 and another container tries to use the same volume at the same time (so we 
need to mount the volume), then unmount and mount operations could happen 
simultaneously for the same volume which is just what we want to avoid by using 
the sequence.

If we erase the sequence after the unmount operation is complete (like in 
[DockerVolumeIsolatorProcess::_cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L670]),
 and another container tries to use the same volume before the sequence is 
erased but after the unmount operation is complete, then we could erase the 
sequence when the mount operation is still ongoing which may cause the mount 
operation is discarded.

> The sequences used in Docker volume isolator are never erased
> -
>
> Key: MESOS-10127
> URL: https://issues.apache.org/jira/browse/MESOS-10127
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Priority: Major
>
> In Docker volume isolator, we use 
> [sequence|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp#L119:L122]
>  to make sure the mount and unmount operations for a single volume are issued 
> serially, but the sequence is never erased which could be a memory leak.
> We have this issue since Mesos 1.0.0 release when Docker volume isolator was 
> introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation

2020-05-17 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108245#comment-17108245
 ] 

Qian Zhang edited comment on MESOS-10126 at 5/18/20, 2:49 AM:
--

RR:

[https://reviews.apache.org/r/72516/]

[https://reviews.apache.org/r/72523/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/72516/]

> Docker volume isolator needs to clean up the `info` struct regardless the 
> result of unmount operation
> -
>
> Key: MESOS-10126
> URL: https://issues.apache.org/jira/browse/MESOS-10126
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
>
> Currently when 
> [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610]
>  is called, we will unmount the volume first, but if the unmount operation 
> fails we will not remove the container's checkpoint directory and NOT erase 
> the container's `info` struct from `infos`. This is problematic, because the 
> remaining `info` in the `infos` will cause the reference count of the volume 
> is larger than 0, but actually the volume is not being used by any 
> containers. And next time when another container using this volume is 
> destroyed, we will NOT unmount the volume since its reference count will be 
> larger than 1 (see 
> [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651]
>  for details) which should be 2, so we will never have chance to unmount this 
> volume.
> We have this issue since Mesos 1.0.0 release when Docker volume isolator was 
> introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation

2020-05-15 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10126:
--

  Sprint: Studio 1: RI-23 68
Story Points: 3
Assignee: Qian Zhang

RR:

[https://reviews.apache.org/r/72516/]

> Docker volume isolator needs to clean up the `info` struct regardless the 
> result of unmount operation
> -
>
> Key: MESOS-10126
> URL: https://issues.apache.org/jira/browse/MESOS-10126
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
>
> Currently when 
> [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610]
>  is called, we will unmount the volume first, but if the unmount operation 
> fails we will not remove the container's checkpoint directory and NOT erase 
> the container's `info` struct from `infos`. This is problematic, because the 
> remaining `info` in the `infos` will cause the reference count of the volume 
> is larger than 0, but actually the volume is not being used by any 
> containers. And next time when another container using this volume is 
> destroyed, we will NOT unmount the volume since its reference count will be 
> larger than 1 (see 
> [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651]
>  for details) which should be 2, so we will never have chance to unmount this 
> volume.
> We have this issue since Mesos 1.0.0 release when Docker volume isolator was 
> introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10127) The sequences used in Docker volume isolator are never erased

2020-05-12 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10127:
--

 Summary: The sequences used in Docker volume isolator are never 
erased
 Key: MESOS-10127
 URL: https://issues.apache.org/jira/browse/MESOS-10127
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Qian Zhang


In Docker volume isolator, we use 
[sequence|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp#L119:L122]
 to make sure the mount and unmount operations for a single volume are issued 
serially, but the sequence is never erased which could be a memory leak.

We have this issue since Mesos 1.0.0 release when Docker volume isolator was 
introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10126) Docker volume isolator needs to clean up the `info` struct regardless the result of unmount operation

2020-05-12 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10126:
--

 Summary: Docker volume isolator needs to clean up the `info` 
struct regardless the result of unmount operation
 Key: MESOS-10126
 URL: https://issues.apache.org/jira/browse/MESOS-10126
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Qian Zhang


Currently when 
[DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610]
 is called, we will unmount the volume first, but if the unmount operation 
fails we will not remove the container's checkpoint directory and NOT erase the 
container's `info` struct from `infos`. This is problematic, because the 
remaining `info` in the `infos` will cause the reference count of the volume is 
larger than 0, but actually the volume is not being used by any containers. And 
next time when another container using this volume is destroyed, we will NOT 
unmount the volume since its reference count will be larger than 1 (see 
[here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651]
 for details) which should be 2, so we will never have chance to unmount this 
volume.

We have this issue since Mesos 1.0.0 release when Docker volume isolator was 
introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`

2020-05-05 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099654#comment-17099654
 ] 

Qian Zhang commented on MESOS-10054:


[~mzhu] mentioned that they are using Docker containerizer to launch custom 
executors in Docker containers, so we still need the above patch, I have 
reopened and committed it.

> Update Docker containerizer to set Docker container’s resource limits and 
> `oom_score_adj`
> -
>
> Key: MESOS-10054
> URL: https://issues.apache.org/jira/browse/MESOS-10054
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> This is to set resource limits for executor which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request

2020-05-05 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099630#comment-17099630
 ] 

Qian Zhang commented on MESOS-10049:


All the code changes about the newly introduced task status reason 
`REASON_CONTAINER_MEMORY_REQUEST_EXCEEDED` (i.e., the above two patches) have 
been reverted, so the patch below for details.

commit 6bb60a4869394f663a09370016127ae8688cbe06
Author: Qian Zhang 
Date: Mon Apr 27 22:34:51 2020 +0800

Reverted the changes about `REASON_CONTAINER_MEMORY_REQUEST_EXCEEDED`.
 
 The method `MemorySubsystemProcess::oomWaited()` will only be invoked when the
 container is OOM killed because it uses more memory than its hard memory limit
 (i.e., the task status reason `REASON_CONTAINER_LIMITATION_MEMORY`), it will
 NOT be invoked when a burstable container is OOM killed because the agent host
 is running out of memory, i.e., we will NOT receive OOM killing notification
 via cgroups notification API for this case. So it is not possible for Mesos to
 provide a task status reason `REASON_CONTAINER_MEMORY_REQUEST_EXCEEDED` for 
this
 case.
 
 Review: https://reviews.apache.org/r/72442

> Add a new reason in `TaskStatus::Reason` for the case that a task is 
> OOM-killed due to exceeding its memory request
> ---
>
> Key: MESOS-10049
> URL: https://issues.apache.org/jira/browse/MESOS-10049
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request

2020-05-05 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099626#comment-17099626
 ] 

Qian Zhang commented on MESOS-10049:


commit be90edd31a1833c5ed706b39f3a5547ae8153dd2
Author: Greg Mann g...@mesosphere.io
Date: Mon Apr 6 15:16:45 2020 -0700


Sent appropriate task status reason when task over memory request.

Review: https://reviews.apache.org/r/72305/

> Add a new reason in `TaskStatus::Reason` for the case that a task is 
> OOM-killed due to exceeding its memory request
> ---
>
> Key: MESOS-10049
> URL: https://issues.apache.org/jira/browse/MESOS-10049
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`

2020-04-28 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094262#comment-17094262
 ] 

Qian Zhang commented on MESOS-10053:


commit 68ce1476aebe10db7107c0f3dc813af78ec20cef
Author: Qian Zhang 
Date: Mon Apr 27 14:14:15 2020 +0800

Set OOM score adj when Docker container's memory limit is infinite.
 
 Review: https://reviews.apache.org/r/72435

> Update Docker executor to set Docker container’s resource limits and 
> `oom_score_adj`
> 
>
> Key: MESOS-10053
> URL: https://issues.apache.org/jira/browse/MESOS-10053
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> This is to set resource limits for command task which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`

2020-04-22 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087845#comment-17087845
 ] 

Qian Zhang edited comment on MESOS-10054 at 4/22/20, 8:32 AM:
--

RR: [https://reviews.apache.org/r/72391/]


was (Author: qianzhang):
RR: 

[https://reviews.apache.org/r/72401/]

[https://reviews.apache.org/r/72391/]

> Update Docker containerizer to set Docker container’s resource limits and 
> `oom_score_adj`
> -
>
> Key: MESOS-10054
> URL: https://issues.apache.org/jira/browse/MESOS-10054
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for executor which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-8877) Docker container's resources will be wrongly enlarged in cgroups after agent recovery

2020-04-22 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-8877:
-

Story Points: 3  (was: 5)
Assignee: Qian Zhang

RR: [https://reviews.apache.org/r/72401/]

> Docker container's resources will be wrongly enlarged in cgroups after agent 
> recovery
> -
>
> Key: MESOS-8877
> URL: https://issues.apache.org/jira/browse/MESOS-8877
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.6.1, 1.6.0, 1.5.1, 1.5.0, 1.4.2, 1.4.1, 1.4.0
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: containerization
>
> Reproduce steps:
> 1. Run `mesos-execute --master=10.0.49.2:5050 
> --task=[file:///home/qzhang/workspace/config/task_docker.json] 
> --checkpoint=true` to launch a Docker container.
> {code:java}
> # cat task_docker.json 
> {
>   "name": "test",
>   "task_id": {"value" : "test"},
>   "agent_id": {"value" : ""},
>   "resources": [
> {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
> {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
>   ],
>   "command": {
> "value": "sleep 5"
>   },
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> 2. When the Docker container is running, we can see its resources in cgroups 
> are correctly set, so far so good.
> {code:java}
> # cat 
> /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/cpu.cfs_quota_us
>  
> 1
> # cat 
> /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/memory.limit_in_bytes
>  
> 33554432
> {code}
> 3. Restart Mesos agent, and then we will see the resources of the Docker 
> container will be wrongly enlarged.
> {code}
> I0503 02:06:17.268340 29512 docker.cpp:1855] Updated 'cpu.shares' to 204 at 
> /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106
>  for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> I0503 02:06:17.271390 29512 docker.cpp:1882] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 
> 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> I0503 02:06:17.273082 29512 docker.cpp:1924] Updated 
> 'memory.soft_limit_in_bytes' to 64MB for container 
> 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> I0503 02:06:17.275908 29512 docker.cpp:1950] Updated 'memory.limit_in_bytes' 
> to 64MB at 
> /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106
>  for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> # cat 
> /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/cpu.cfs_quota_us
> 2
> # cat 
> /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/memory.limit_in_bytes
> 67108864
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10117) Update the `usage()` method of containerizer to set resource limits in the `ResourceStatistics` protobuf message

2020-04-22 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088682#comment-17088682
 ] 

Qian Zhang edited comment on MESOS-10117 at 4/22/20, 8:15 AM:
--

RR:

[https://reviews.apache.org/r/72398/]

[https://reviews.apache.org/r/72399/]

[https://reviews.apache.org/r/72402/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/72398/]

[https://reviews.apache.org/r/72399/]

[https://reviews.apache.org/r/72400/]

[https://reviews.apache.org/r/72402/]

> Update the `usage()` method of containerizer to set resource limits in the 
> `ResourceStatistics` protobuf message
> 
>
> Key: MESOS-10117
> URL: https://issues.apache.org/jira/browse/MESOS-10117
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> In the `ResourceStatistics` protobuf message, there are a couple of issues:
>  # There are already `cpu_limit` and `mem_limit_bytes` fields, but they are 
> actually CPU & memory requests when resources limits are specified for a task.
>  # There is already `mem_soft_limit_bytes` field, but this field seems not 
> set anywhere.
> So we need to update this protobuf message and also the related containerizer 
> code which set the fields of this protobuf message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`

2020-04-21 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087845#comment-17087845
 ] 

Qian Zhang edited comment on MESOS-10054 at 4/21/20, 1:22 PM:
--

RR: 

[https://reviews.apache.org/r/72401/]

[https://reviews.apache.org/r/72391/]


was (Author: qianzhang):
RR: [https://reviews.apache.org/r/72391/]

> Update Docker containerizer to set Docker container’s resource limits and 
> `oom_score_adj`
> -
>
> Key: MESOS-10054
> URL: https://issues.apache.org/jira/browse/MESOS-10054
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for executor which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`

2020-04-20 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10054:
--

Assignee: Qian Zhang

> Update Docker containerizer to set Docker container’s resource limits and 
> `oom_score_adj`
> -
>
> Key: MESOS-10054
> URL: https://issues.apache.org/jira/browse/MESOS-10054
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for executor which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10117) Update the `usage()` method of containerizer to set resource limits in `ResourceStatistics`

2020-04-15 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10117:
--

 Summary: Update the `usage()` method of containerizer to set 
resource limits in `ResourceStatistics`
 Key: MESOS-10117
 URL: https://issues.apache.org/jira/browse/MESOS-10117
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Qian Zhang
Assignee: Qian Zhang


In the `ResourceStatistics` protobuf message, there are a couple of issues:
 # There are already `cpu_limit` and `mem_limit_bytes` fields, but they are 
actually CPU & memory requests when resources limits are specified for a task.
 # There is already `mem_soft_limit_bytes` field, but this field seems not set 
anywhere.

So we need to update this protobuf message and also the related containerizer 
code which set the fields of this protobuf message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10115) Add document for task resource limits

2020-04-14 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10115:
--

Assignee: Greg Mann  (was: Qian Zhang)

> Add document for task resource limits
> -
>
> Key: MESOS-10115
> URL: https://issues.apache.org/jira/browse/MESOS-10115
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10115) Add document for task resource limits

2020-04-14 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10115:
--

 Summary: Add document for task resource limits
 Key: MESOS-10115
 URL: https://issues.apache.org/jira/browse/MESOS-10115
 Project: Mesos
  Issue Type: Task
  Components: documentation
Reporter: Qian Zhang
Assignee: Qian Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10048) Update the memory subsystem in the cgroup isolator to set container’s memory resource limits and `oom_score_adj`

2020-03-24 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065426#comment-17065426
 ] 

Qian Zhang commented on MESOS-10048:


https://reviews.apache.org/r/72263/

> Update the memory subsystem in the cgroup isolator to set container’s memory 
> resource limits and `oom_score_adj`
> 
>
> Key: MESOS-10048
> URL: https://issues.apache.org/jira/browse/MESOS-10048
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> Update the memory subsystem in the cgroup isolator to set container’s memory 
> resource limits and `oom_score_adj`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10046) Launch executor container with resource limits

2020-03-20 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986931#comment-16986931
 ] 

Qian Zhang edited comment on MESOS-10046 at 3/20/20, 9:13 AM:
--

RR:

[https://reviews.apache.org/r/71856/]

[https://reviews.apache.org/r/71858/]


was (Author: qianzhang):
RR: [https://reviews.apache.org/r/71856/]

> Launch executor container with resource limits
> --
>
> Key: MESOS-10046
> URL: https://issues.apache.org/jira/browse/MESOS-10046
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> We need to add resource limits into `ContainerConfig` first, and then set the 
> resources limits in it according to the executor/task resource limits when 
> launching executor container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10064) Accommodate the "Infinity" value in JSON

2020-03-20 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063209#comment-17063209
 ] 

Qian Zhang edited comment on MESOS-10064 at 3/20/20, 9:11 AM:
--

commit 0b47b43d290494fc1c6a6f6241ddfbceeb686997
 Author: Qian Zhang 
 Date: Sun Feb 23 09:53:32 2020 +0800

Added patch for RapidJSON.

This commit updates the writer of RapidJSON to write infinite
 floating point numbers as "Infinity" and "-Infinity" (i.e.,
 with double quotes) rather than Infinity and -Infinity. This
 is to ensure the strings converted from JSON objects conform
 to the rule defined by Protobuf:
 [https://developers.google.com/protocol-buffers/docs/proto3#json]

Review: [https://reviews.apache.org/r/72161]

 

commit ec82a516918ebd663816cb110f73bdee6e5268be
 Author: Qian Zhang 
 Date: Sun Feb 23 10:09:48 2020 +0800

Accommodated the "Infinity" value in the JSON <-> Protobuf conversion.

Review: [https://reviews.apache.org/r/72162]


was (Author: qianzhang):
commit 0b47b43d290494fc1c6a6f6241ddfbceeb686997
Author: Qian Zhang 
Date: Sun Feb 23 09:53:32 2020 +0800

Added patch for RapidJSON.
 
 This commit updates the writer of RapidJSON to write infinite
 floating point numbers as "Infinity" and "-Infinity" (i.e.,
 with double quotes) rather than Infinity and -Infinity. This
 is to ensure the strings converted from JSON objects conform
 to the rule defined by Protobuf:
 https://developers.google.com/protocol-buffers/docs/proto3#json
 
 Review: [https://reviews.apache.org/r/72161]

commit ec82a516918ebd663816cb110f73bdee6e5268be
Author: Qian Zhang 
Date: Sun Feb 23 10:09:48 2020 +0800

Accommodated the "Infinity" value in the JSON <-> Protobuf conversion.
 
 Review: [https://reviews.apache.org/r/72162]

> Accommodate the "Infinity" value in JSON
> 
>
> Key: MESOS-10064
> URL: https://issues.apache.org/jira/browse/MESOS-10064
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> See 
> [here|https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?ts=5de78977#heading=h.ejuvxat6x3eb]
>  for what need to be done for this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10064) Accommodate the "Infinity" value in JSON

2020-03-20 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044015#comment-17044015
 ] 

Qian Zhang edited comment on MESOS-10064 at 3/20/20, 9:09 AM:
--

RR:

[https://reviews.apache.org/r/72161/]

[https://reviews.apache.org/r/72161/]


was (Author: qianzhang):
RR: https://reviews.apache.org/r/72161/

> Accommodate the "Infinity" value in JSON
> 
>
> Key: MESOS-10064
> URL: https://issues.apache.org/jira/browse/MESOS-10064
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> See 
> [here|https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?ts=5de78977#heading=h.ejuvxat6x3eb]
>  for what need to be done for this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`

2020-03-07 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017766#comment-17017766
 ] 

Qian Zhang edited comment on MESOS-10053 at 3/7/20, 9:04 AM:
-

RR:

[https://reviews.apache.org/r/72022/]

[https://reviews.apache.org/r/72027/]

[https://reviews.apache.org/r/72211/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/72022/]

[https://reviews.apache.org/r/72027/]

> Update Docker executor to set Docker container’s resource limits and 
> `oom_score_adj`
> 
>
> Key: MESOS-10053
> URL: https://issues.apache.org/jira/browse/MESOS-10053
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for command task which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits

2020-03-07 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989493#comment-16989493
 ] 

Qian Zhang edited comment on MESOS-10047 at 3/7/20, 9:00 AM:
-

RR:

[https://reviews.apache.org/r/71886/]

[https://reviews.apache.org/r/71953/]

[https://reviews.apache.org/r/71955/]

[https://reviews.apache.org/r/71956/]

[https://reviews.apache.org/r/72210/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/71886/]

[https://reviews.apache.org/r/71953/]

[https://reviews.apache.org/r/71955/]

[https://reviews.apache.org/r/71956/]

> Update the CPU subsystem in the cgroup isolator to set container's CPU 
> resource limits
> --
>
> Key: MESOS-10047
> URL: https://issues.apache.org/jira/browse/MESOS-10047
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10064) Accommodate the "Infinity" value in JSON

2020-02-20 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10064:
--

Assignee: Qian Zhang

> Accommodate the "Infinity" value in JSON
> 
>
> Key: MESOS-10064
> URL: https://issues.apache.org/jira/browse/MESOS-10064
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> See 
> [here|https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?ts=5de78977#heading=h.ejuvxat6x3eb]
>  for what need to be done for this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10051) Update the `LaunchContainer` agent API to support container resource limits

2020-01-23 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10051:
--

  Sprint: Studio 1: RI-23 64
Story Points: 2
Assignee: Qian Zhang

RR: https://reviews.apache.org/r/72040/

> Update the `LaunchContainer` agent API to support container resource limits
> ---
>
> Key: MESOS-10051
> URL: https://issues.apache.org/jira/browse/MESOS-10051
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10063) Update default executor to call `LAUNCH_CONTAINER` to launch nested containers

2020-01-23 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10063:
--

  Sprint: Studio 1: RI-23 64
Story Points: 2
Assignee: Qian Zhang

RR: [https://reviews.apache.org/r/72041/]

> Update default executor to call `LAUNCH_CONTAINER` to launch nested containers
> --
>
> Key: MESOS-10063
> URL: https://issues.apache.org/jira/browse/MESOS-10063
> Project: Mesos
>  Issue Type: Task
>  Components: executor
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> The default executor will be updated to use the LAUNCH_CONTAINER call instead 
> of the LAUNCH_NESTED_CONTAINER call when launching nested containers. This 
> will allow the default executor to set task limits when launching its task 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`

2020-01-20 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017766#comment-17017766
 ] 

Qian Zhang edited comment on MESOS-10053 at 1/20/20 8:59 AM:
-

RR:

[https://reviews.apache.org/r/72022/]

[https://reviews.apache.org/r/72027/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/72022/]

> Update Docker executor to set Docker container’s resource limits and 
> `oom_score_adj`
> 
>
> Key: MESOS-10053
> URL: https://issues.apache.org/jira/browse/MESOS-10053
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for command task which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10053) Update Docker executor to set Docker container’s resource limits and `oom_score_adj`

2020-01-16 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10053:
--

  Sprint: Studio 1: RI-23 64
Story Points: 3
Assignee: Qian Zhang

RR:

[https://reviews.apache.org/r/72022/]

> Update Docker executor to set Docker container’s resource limits and 
> `oom_score_adj`
> 
>
> Key: MESOS-10053
> URL: https://issues.apache.org/jira/browse/MESOS-10053
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for command task which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10087) Updated master & agent's HTTP endpoints for showing resource limits

2020-01-13 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10087:
--

 Summary: Updated master & agent's HTTP endpoints for showing 
resource limits
 Key: MESOS-10087
 URL: https://issues.apache.org/jira/browse/MESOS-10087
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


We need to update Mesos master's `/state`, `/frameworks`, `/tasks` endpoints 
and agent's `/state` endpoint to show task's resource limits in their outputs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10050) Update the `update()` method of containerizer to handle container resource limits

2020-01-11 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008336#comment-17008336
 ] 

Qian Zhang edited comment on MESOS-10050 at 1/11/20 2:59 PM:
-

RR:

[https://reviews.apache.org/r/71950/]

[https://reviews.apache.org/r/71951/]

[https://reviews.apache.org/r/71952/]

[https://reviews.apache.org/r/71983/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/71950/]

[https://reviews.apache.org/r/71951/]

[https://reviews.apache.org/r/71952/]

> Update the `update()` method of containerizer to handle container resource 
> limits
> -
>
> Key: MESOS-10050
> URL: https://issues.apache.org/jira/browse/MESOS-10050
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits

2020-01-06 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989493#comment-16989493
 ] 

Qian Zhang edited comment on MESOS-10047 at 1/6/20 8:50 AM:


RR:

[https://reviews.apache.org/r/71886/]

[https://reviews.apache.org/r/71953/]

[https://reviews.apache.org/r/71955/]

[https://reviews.apache.org/r/71956/]


was (Author: qianzhang):
RR:

[https://reviews.apache.org/r/71886/]

[https://reviews.apache.org/r/71953/]

> Update the CPU subsystem in the cgroup isolator to set container's CPU 
> resource limits
> --
>
> Key: MESOS-10047
> URL: https://issues.apache.org/jira/browse/MESOS-10047
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits

2020-01-05 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989493#comment-16989493
 ] 

Qian Zhang edited comment on MESOS-10047 at 1/6/20 6:22 AM:


RR:

[https://reviews.apache.org/r/71886/]

[https://reviews.apache.org/r/71953/]


was (Author: qianzhang):
RR: [https://reviews.apache.org/r/71886/]

> Update the CPU subsystem in the cgroup isolator to set container's CPU 
> resource limits
> --
>
> Key: MESOS-10047
> URL: https://issues.apache.org/jira/browse/MESOS-10047
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10050) Update the `update()` method of containerizer to handle container resource limits

2020-01-05 Thread Qian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-10050:
--

  Sprint: Studio 1: RI-22 62
Story Points: 5
Assignee: Qian Zhang

> Update the `update()` method of containerizer to handle container resource 
> limits
> -
>
> Key: MESOS-10050
> URL: https://issues.apache.org/jira/browse/MESOS-10050
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10050) Update the `update()` method of containerizer to handle container resource limits

2020-01-05 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008336#comment-17008336
 ] 

Qian Zhang commented on MESOS-10050:


RR:

[https://reviews.apache.org/r/71950/]

[https://reviews.apache.org/r/71951/]

[https://reviews.apache.org/r/71952/]

> Update the `update()` method of containerizer to handle container resource 
> limits
> -
>
> Key: MESOS-10050
> URL: https://issues.apache.org/jira/browse/MESOS-10050
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10048) Update the memory subsystem in the cgroup isolator to set container’s memory resource limits and `oom_score_adj`

2020-01-01 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006525#comment-17006525
 ] 

Qian Zhang commented on MESOS-10048:


RR:

[https://reviews.apache.org/r/71943/]

[https://reviews.apache.org/r/71944/]

> Update the memory subsystem in the cgroup isolator to set container’s memory 
> resource limits and `oom_score_adj`
> 
>
> Key: MESOS-10048
> URL: https://issues.apache.org/jira/browse/MESOS-10048
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> [|https://reviews.apache.org/r/71944/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   >