[jira] [Created] (MESOS-9041) Break agent dependencies out of libmesos.
James Peach created MESOS-9041: -- Summary: Break agent dependencies out of libmesos. Key: MESOS-9041 URL: https://issues.apache.org/jira/browse/MESOS-9041 Project: Mesos Issue Type: Task Components: agent, build Reporter: James Peach {{libmesos.so}} includes all the dependencies for both the master and the agent. This means that is has way more symbols than necessary (causing inflated built times), and drags in dependencies (e.g. libnl.so, libblkid.so) that are only necessary on the agent. We should attempt to separate the agent code out of {{libmesos.so}}, which would improve the build cleanliness and hopefully performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.
[ https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528651#comment-16528651 ] James Peach commented on MESOS-9040: /cc [~benjaminhindman] > Break scheduler driver dependency on mesos-local. > - > > Key: MESOS-9040 > URL: https://issues.apache.org/jira/browse/MESOS-9040 > Project: Mesos > Issue Type: Task > Components: build, scheduler driver >Reporter: James Peach >Priority: Minor > > The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies > on the {{mesos-local}} code. This seems fairly hacky, but it also causes > binary dependencies on {{src/local/local.cpp}} to be dragged into > {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which > could be isolated in the {{mesos-local}} command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9040) Break scheduler driver dependency on mesos-local.
James Peach created MESOS-9040: -- Summary: Break scheduler driver dependency on mesos-local. Key: MESOS-9040 URL: https://issues.apache.org/jira/browse/MESOS-9040 Project: Mesos Issue Type: Task Components: build, scheduler driver Reporter: James Peach The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies on the {{mesos-local}} code. This seems fairly hacky, but it also causes binary dependencies on {{src/local/local.cpp}} to be dragged into {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which could be isolated in the {{mesos-local}} command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9030) mock_slave.cpp fails to build with GCC 8.
[ https://issues.apache.org/jira/browse/MESOS-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524610#comment-16524610 ] James Peach commented on MESOS-9030: Verified that using googletest master doesn't fix this. > mock_slave.cpp fails to build with GCC 8. > - > > Key: MESOS-9030 > URL: https://issues.apache.org/jira/browse/MESOS-9030 > Project: Mesos > Issue Type: Task > Components: build, test >Reporter: James Peach >Priority: Major > > {noformat} > In file included from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/tests/mock_slave.cpp:19: > ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of > ‘process::Future::Future(const U&) [with U = testing::Matcher std::tuple&>&>; T = Nothing]’: > /usr/include/c++/8/type_traits:932:12: required from ‘struct > std::is_constructible&, testing::Matcher std::tuple&>&>&&>’ > /usr/include/c++/8/type_traits:138:12: required from ‘struct > std::__and_&, > testing::Matcher&>&>&&> >’ > /usr/include/c++/8/tuple:485:68: required from ‘static constexpr bool > std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements > = {testing::Matcher&>&>}; > bool = true; _Elements = {const process::Future&}]’ > /usr/include/c++/8/tuple:641:59: required by substitution of > ‘template sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const > process::Future&>::_NotSameTuple<_UElements ...>()), const > process::Future&>::_MoveConstructibleTuple<_UElements ...>() && > std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) > == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), > const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements > ...>()) && (1 >= 1)), bool>::type > constexpr std::tuple process::Future&>::tuple(_UElements&& ...) [with _UElements = > {testing::Matcher&>&>}; > typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && > std::_TC<(sizeof... (_UElements) == 1), const > process::Future&>::_NotSameTuple<_UElements ...>()), const > process::Future&>::_MoveConstructibleTuple<_UElements ...>() && > std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) > == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), > const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements > ...>()) && (1 >= 1)), bool>::type = 1]’ > ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:894:37: >required from > ‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*, > const char*, int, const string&, const ArgumentMatcherTuple&) [with F = > void(const process::Future&); testing::internal::string = > std::__cxx11::basic_string; > testing::internal::TypedExpectation::ArgumentMatcherTuple = > std::tuple&> >]’ > ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9: >required from ‘testing::internal::TypedExpectation& > testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, > const string&, const ArgumentMatcherTuple&) [with F = void(const > process::Future&); testing::internal::string = > std::__cxx11::basic_string; > testing::internal::FunctionMockerBase::ArgumentMatcherTuple = > std::tuple&> >]’ > ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43: >required from ‘testing::internal::TypedExpectation& > testing::internal::MockSpec::InternalExpectedAt(const char*, int, const > char*, const char*) [with F = void(const process::Future&)]’ > ../../src/tests/mock_slave.cpp:139:3: required from here > ../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no > matching function for call to ‘process::Future::set(const > testing::Matcher&>&>&)’ >set(u); >^~~ > ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: > ‘bool process::Future::set(const T&) [with T = Nothing]’ > bool Future::set(const T& t) > ^ > ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: no known > conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘const Nothing&’ > ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: > ‘bool process::Future::set(T&&) [with T = Nothing]’ > bool Future::set(T&& t) > ^ > ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: no known > conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘Nothing&&’ > ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of > ‘process::Future::Future(const U&) [with U = const > testing::MatcherInterface process::Future&>&>*; T = Nothing]’: >
[jira] [Commented] (MESOS-9030) mock_slave.cpp fails to build with GCC 8.
[ https://issues.apache.org/jira/browse/MESOS-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524348#comment-16524348 ] James Peach commented on MESOS-9030: {noformat} $ gcc --version gcc (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1) {noformat} > mock_slave.cpp fails to build with GCC 8. > - > > Key: MESOS-9030 > URL: https://issues.apache.org/jira/browse/MESOS-9030 > Project: Mesos > Issue Type: Task > Components: build, test >Reporter: James Peach >Priority: Major > > {noformat} > In file included from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/tests/mock_slave.cpp:19: > ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of > ‘process::Future::Future(const U&) [with U = testing::Matcher std::tuple&>&>; T = Nothing]’: > /usr/include/c++/8/type_traits:932:12: required from ‘struct > std::is_constructible&, testing::Matcher std::tuple&>&>&&>’ > /usr/include/c++/8/type_traits:138:12: required from ‘struct > std::__and_&, > testing::Matcher&>&>&&> >’ > /usr/include/c++/8/tuple:485:68: required from ‘static constexpr bool > std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements > = {testing::Matcher&>&>}; > bool = true; _Elements = {const process::Future&}]’ > /usr/include/c++/8/tuple:641:59: required by substitution of > ‘template sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const > process::Future&>::_NotSameTuple<_UElements ...>()), const > process::Future&>::_MoveConstructibleTuple<_UElements ...>() && > std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) > == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), > const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements > ...>()) && (1 >= 1)), bool>::type > constexpr std::tuple process::Future&>::tuple(_UElements&& ...) [with _UElements = > {testing::Matcher&>&>}; > typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && > std::_TC<(sizeof... (_UElements) == 1), const > process::Future&>::_NotSameTuple<_UElements ...>()), const > process::Future&>::_MoveConstructibleTuple<_UElements ...>() && > std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) > == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), > const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements > ...>()) && (1 >= 1)), bool>::type = 1]’ > ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:894:37: >required from > ‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*, > const char*, int, const string&, const ArgumentMatcherTuple&) [with F = > void(const process::Future&); testing::internal::string = > std::__cxx11::basic_string; > testing::internal::TypedExpectation::ArgumentMatcherTuple = > std::tuple&> >]’ > ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9: >required from ‘testing::internal::TypedExpectation& > testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, > const string&, const ArgumentMatcherTuple&) [with F = void(const > process::Future&); testing::internal::string = > std::__cxx11::basic_string; > testing::internal::FunctionMockerBase::ArgumentMatcherTuple = > std::tuple&> >]’ > ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43: >required from ‘testing::internal::TypedExpectation& > testing::internal::MockSpec::InternalExpectedAt(const char*, int, const > char*, const char*) [with F = void(const process::Future&)]’ > ../../src/tests/mock_slave.cpp:139:3: required from here > ../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no > matching function for call to ‘process::Future::set(const > testing::Matcher&>&>&)’ >set(u); >^~~ > ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: > ‘bool process::Future::set(const T&) [with T = Nothing]’ > bool Future::set(const T& t) > ^ > ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: no known > conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘const Nothing&’ > ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: > ‘bool process::Future::set(T&&) [with T = Nothing]’ > bool Future::set(T&& t) > ^ > ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: no known > conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘Nothing&&’ > ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of > ‘process::Future::Future(const U&) [with U = const > testing::MatcherInterface process::Future&>&>*; T =
[jira] [Created] (MESOS-9030) mock_slave.cpp fails to build with GCC 8.
James Peach created MESOS-9030: -- Summary: mock_slave.cpp fails to build with GCC 8. Key: MESOS-9030 URL: https://issues.apache.org/jira/browse/MESOS-9030 Project: Mesos Issue Type: Task Components: build, test Reporter: James Peach {noformat} In file included from ../../include/mesos/authentication/secret_generator.hpp:22, from ../../src/tests/mock_slave.cpp:19: ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of ‘process::Future::Future(const U&) [with U = testing::Matcher&>&>; T = Nothing]’: /usr/include/c++/8/type_traits:932:12: required from ‘struct std::is_constructible&, testing::Matcher&>&>&&>’ /usr/include/c++/8/type_traits:138:12: required from ‘struct std::__and_&, testing::Matcher&>&>&&> >’ /usr/include/c++/8/tuple:485:68: required from ‘static constexpr bool std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements = {testing::Matcher&>&>}; bool = true; _Elements = {const process::Future&}]’ /usr/include/c++/8/tuple:641:59: required by substitution of ‘template&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_MoveConstructibleTuple<_UElements ...>() && std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) && (1 >= 1)), bool>::type > constexpr std::tuple&>::tuple(_UElements&& ...) [with _UElements = {testing::Matcher&>&>}; typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_MoveConstructibleTuple<_UElements ...>() && std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) && (1 >= 1)), bool>::type = 1]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:894:37: required from ‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*, const char*, int, const string&, const ArgumentMatcherTuple&) [with F = void(const process::Future&); testing::internal::string = std::__cxx11::basic_string; testing::internal::TypedExpectation::ArgumentMatcherTuple = std::tuple&> >]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9: required from ‘testing::internal::TypedExpectation& testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, const string&, const ArgumentMatcherTuple&) [with F = void(const process::Future&); testing::internal::string = std::__cxx11::basic_string; testing::internal::FunctionMockerBase::ArgumentMatcherTuple = std::tuple&> >]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43: required from ‘testing::internal::TypedExpectation& testing::internal::MockSpec::InternalExpectedAt(const char*, int, const char*, const char*) [with F = void(const process::Future&)]’ ../../src/tests/mock_slave.cpp:139:3: required from here ../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no matching function for call to ‘process::Future::set(const testing::Matcher&>&>&)’ set(u); ^~~ ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: ‘bool process::Future::set(const T&) [with T = Nothing]’ bool Future::set(const T& t) ^ ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: no known conversion for argument 1 from ‘const testing::Matcher&>&>’ to ‘const Nothing&’ ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: ‘bool process::Future::set(T&&) [with T = Nothing]’ bool Future::set(T&& t) ^ ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: no known conversion for argument 1 from ‘const testing::Matcher&>&>’ to ‘Nothing&&’ ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of ‘process::Future::Future(const U&) [with U = const testing::MatcherInterface&>&>*; T = Nothing]’: /usr/include/c++/8/type_traits:932:12: required from ‘struct std::is_constructible&, const testing::MatcherInterface&>&>*&>’ /usr/include/c++/8/type_traits:138:12: required from ‘struct std::__and_&, const testing::MatcherInterface&>&>*&> >’ /usr/include/c++/8/tuple:485:68: required from ‘static constexpr bool std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements = {const testing::MatcherInterface&>&>*&}; bool = true; _Elements = {const process::Future&}]’ /usr/include/c++/8/tuple:641:59: required by substitution of ‘template&>::_NotSameTuple<_UElements ...>()), const
[jira] [Commented] (MESOS-9021) Specify allowed devices for tasks
[ https://issues.apache.org/jira/browse/MESOS-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520626#comment-16520626 ] James Peach commented on MESOS-9021: Added link to design doc. This is basically straight forward, but we need to think through the security implications and the mechanism by which operators can apply access control. > Specify allowed devices for tasks > - > > Key: MESOS-9021 > URL: https://issues.apache.org/jira/browse/MESOS-9021 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: James Peach >Priority: Minor > > Container devices can be specified globally, but not for specific tasks. We > should extend the API to allow schedulers to specify allowed devices for > particular tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9021) Specify allowed devices for tasks
James Peach created MESOS-9021: -- Summary: Specify allowed devices for tasks Key: MESOS-9021 URL: https://issues.apache.org/jira/browse/MESOS-9021 Project: Mesos Issue Type: Task Components: containerization Reporter: James Peach Container devices can be specified globally, but not for specific tasks. We should extend the API to allow schedulers to specify allowed devices for particular tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9002) Mem access error in os::Fork::Tree
[ https://issues.apache.org/jira/browse/MESOS-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9002: -- Assignee: James Peach Priority: Minor (was: Major) Fix Version/s: 1.7.0 | [r/67614|https://reviews.apache.org/r/67614] | Removed memcpy from os::Fork::instantiate. | > Mem access error in os::Fork::Tree > -- > > Key: MESOS-9002 > URL: https://issues.apache.org/jira/browse/MESOS-9002 > Project: Mesos > Issue Type: Task >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Fix For: 1.7.0 > > > Building Mesos with gcc 8.1 (Fedora 28) > {noformat} > ../../3rdparty/stout/include/stout/os/posix/fork.hpp: In member function > ‘pid_t os::Fork::instantiate(const os::Fork::Tree&) const’: > ../../3rdparty/stout/include/stout/os/posix/fork.hpp:354:61: error: ‘void* > memcpy(void*, const void*, size_t)’ writing to an object of type ‘using > element_type = std::remove_extent::type’ {aka ‘struct > os::Fork::Tree::Memory’} with no trivial copy-assignment > [-Werror=class-memaccess] > memcpy(tree.memory.get(), , sizeof(Tree::Memory)); > ^ > ../../3rdparty/stout/include/stout/os/posix/fork.hpp:235:12: note: ‘using > element_type = std::remove_extent::type’ {aka ‘struct > os::Fork::Tree::Memory’} declared here > struct Memory { > ^~ > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9002) Mem access error in os::Fork::Tree
James Peach created MESOS-9002: -- Summary: Mem access error in os::Fork::Tree Key: MESOS-9002 URL: https://issues.apache.org/jira/browse/MESOS-9002 Project: Mesos Issue Type: Task Reporter: James Peach Building Mesos with gcc 8.1 (Fedora 28) {noformat} ../../3rdparty/stout/include/stout/os/posix/fork.hpp: In member function ‘pid_t os::Fork::instantiate(const os::Fork::Tree&) const’: ../../3rdparty/stout/include/stout/os/posix/fork.hpp:354:61: error: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘using element_type = std::remove_extent::type’ {aka ‘struct os::Fork::Tree::Memory’} with no trivial copy-assignment [-Werror=class-memaccess] memcpy(tree.memory.get(), , sizeof(Tree::Memory)); ^ ../../3rdparty/stout/include/stout/os/posix/fork.hpp:235:12: note: ‘using element_type = std::remove_extent::type’ {aka ‘struct os::Fork::Tree::Memory’} declared here struct Memory { ^~ {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513175#comment-16513175 ] James Peach commented on MESOS-5158: For CSI volumes, we can assume that the CSI plugin is enforcing quota and ignore it in the isolator. This means that if we call {{getPersistentVolumePath()}}, we have to verify that it is not CSI volume beforehand. > Provide XFS quota support for persistent volumes. > - > > Key: MESOS-5158 > URL: https://issues.apache.org/jira/browse/MESOS-5158 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Yan Xu >Assignee: James Peach >Priority: Major > > Given that the lifecycle of persistent volumes is managed outside of the > isolator, we may need to further abstract out the quota management > functionality to do it outside the XFS isolator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-5158) Provide XFS quota support for persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-5158: -- Assignee: James Peach > Provide XFS quota support for persistent volumes. > - > > Key: MESOS-5158 > URL: https://issues.apache.org/jira/browse/MESOS-5158 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Yan Xu >Assignee: James Peach >Priority: Major > > Given that the lifecycle of persistent volumes is managed outside of the > isolator, we may need to further abstract out the quota management > functionality to do it outside the XFS isolator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-5158) Provide XFS quota support for persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513015#comment-16513015 ] James Peach edited comment on MESOS-5158 at 6/14/18 9:27 PM: - Persistent volumes are managed in {{Slave::syncCheckpointedResources()}}, which will create new volumes and also delete old ones. The isolators are not notified about these changes. To support persistent volumes in the XFS isolators, we need to do a few things: # On recovery, we need to scan existing persistent volumes in order to recover the project IDs # On resources update, we need to notice any new persistent volumes and allocate a project ID for them # Periodically, we need to re-scan the persistent volumes to reclaim project IDs for volumes that have been deleted. # If we are doing active enforcement, we need to add the persistent volumes into the set of quotas that we are polling for usage. We need to consider which tasks would be killed if the volume is filled. There's no explicit way to support the {{GROW_VOLUME}} or {{SHRINK_VOLUME}} operations since we would need to know how to update the quota when that happens. The agent doesn't explicitly grow the volume, it just updates its checkpointed resources. However, updating the quota when it is attached to a task would work, since the size of shared volumes cannot be altered. was (Author: jamespeach): Persistent volumes are managed in {{Slave::syncCheckpointedResources()}}, which will create new volumes and also delete old ones. The isolators are not notified about these changes. To support persistent volumes in the XFS isolators, we need to do a few things: # On recovery, we need to scan existing persistent volumes in order to recover the project IDs # On resources update, we need to notice any new persistent volumes and allocate a project ID for them # Periodically, we need to re-scan the persistent volumes to reclaim project IDs for volumes that have been deleted. # If we are doing active enforcement, we need to add the persistent volumes into the set of quotas that we are polling for usage. We need to consider which tasks would be killed if the volume is filled. There's no explicit way to support the the {{GROW_VOLUME}} or {{SHRINK_VOLUME}} operations since we would need to know how to update the quota when that happens. The agent doesn't explicitly grow the volume, it just updates its checkpointed resources. However, updating the quota when it is attached to a task would work, since the size of shared volumes cannot be altered. > Provide XFS quota support for persistent volumes. > - > > Key: MESOS-5158 > URL: https://issues.apache.org/jira/browse/MESOS-5158 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Yan Xu >Priority: Major > > Given that the lifecycle of persistent volumes is managed outside of the > isolator, we may need to further abstract out the quota management > functionality to do it outside the XFS isolator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513015#comment-16513015 ] James Peach commented on MESOS-5158: Persistent volumes are managed in {{Slave::syncCheckpointedResources()}}, which will create new volumes and also delete old ones. The isolators are not notified about these changes. To support persistent volumes in the XFS isolators, we need to do a few things: # On recovery, we need to scan existing persistent volumes in order to recover the project IDs # On resources update, we need to notice any new persistent volumes and allocate a project ID for them # Periodically, we need to re-scan the persistent volumes to reclaim project IDs for volumes that have been deleted. # If we are doing active enforcement, we need to add the persistent volumes into the set of quotas that we are polling for usage. We need to consider which tasks would be killed if the volume is filled. There's no explicit way to support the the {{GROW_VOLUME}} or {{SHRINK_VOLUME}} operations since we would need to know how to update the quota when that happens. The agent doesn't explicitly grow the volume, it just updates its checkpointed resources. However, updating the quota when it is attached to a task would work, since the size of shared volumes cannot be altered. > Provide XFS quota support for persistent volumes. > - > > Key: MESOS-5158 > URL: https://issues.apache.org/jira/browse/MESOS-5158 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Yan Xu >Priority: Major > > Given that the lifecycle of persistent volumes is managed outside of the > isolator, we may need to further abstract out the quota management > functionality to do it outside the XFS isolator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-6823: -- Resolution: Fixed Assignee: Jie Yu Fix Version/s: 1.7.0 {noformat} commit 32d4305b87e79ed02cc686e0c29b027e31c6b3a4 Author: Jie Yu Date: Thu May 24 10:05:17 2018 -0700 Adjusted the tests that use nobody. Used `$SUDO_USER` instead because `nobody` sometimes cannot access direcotries under `$HOME` of the current user running the tests. Review: https://reviews.apache.org/r/67291 {noformat} > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > is flaky > -- > > Key: MESOS-6823 > URL: https://issues.apache.org/jira/browse/MESOS-6823 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 12/14 both with/without SSL >Reporter: Anand Mazumdar >Assignee: Jie Yu >Priority: Major > Labels: flaky, flaky-test, newbie > Fix For: 1.7.0 > > > This showed up on our internal CI > {code} > [23:13:01] : [Step 11/11] [ RUN ] > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > [23:13:01] : [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] > Creating default 'local' authorizer > [23:13:01] : [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] > Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) > started on 172.16.10.213:45407 > [23:13:01] : [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" > --zk_session_timeout="10secs" > [23:13:01] : [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] > Master only allowing authenticated frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] > Master only allowing authenticated agents to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] > Loading credentials for authentication from > '/mnt/teamcity/temp/buildTmp/ev3icd/credentials' > [23:13:01] : [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using > default 'crammd5' authenticator > [23:13:01] : [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [23:13:01] : [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [23:13:01] : [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [23:13:01] : [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] > Authorization enabled > [23:13:01] : [Step 11/11] I1219 23:13:01.654551 25733 > whitelist_watcher.cpp:77] No whitelist given > [23:13:01] : [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] > Initialized hierarchical allocator process > [23:13:01] : [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] > Elected as the leading master! > [23:13:01] : [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] > Recovering from registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] > Recovering registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] > Successfully fetched the registry (0B) in
[jira] [Commented] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477709#comment-16477709 ] James Peach commented on MESOS-6823: Suggestion ... rather than execute as {{nobody}}, use {{os::getenv("SUDO_USER")}}. > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > is flaky > -- > > Key: MESOS-6823 > URL: https://issues.apache.org/jira/browse/MESOS-6823 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 12/14 both with/without SSL >Reporter: Anand Mazumdar >Priority: Major > Labels: flaky, flaky-test, newbie > > This showed up on our internal CI > {code} > [23:13:01] : [Step 11/11] [ RUN ] > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > [23:13:01] : [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] > Creating default 'local' authorizer > [23:13:01] : [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] > Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) > started on 172.16.10.213:45407 > [23:13:01] : [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" > --zk_session_timeout="10secs" > [23:13:01] : [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] > Master only allowing authenticated frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] > Master only allowing authenticated agents to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] > Loading credentials for authentication from > '/mnt/teamcity/temp/buildTmp/ev3icd/credentials' > [23:13:01] : [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using > default 'crammd5' authenticator > [23:13:01] : [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [23:13:01] : [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [23:13:01] : [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [23:13:01] : [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] > Authorization enabled > [23:13:01] : [Step 11/11] I1219 23:13:01.654551 25733 > whitelist_watcher.cpp:77] No whitelist given > [23:13:01] : [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] > Initialized hierarchical allocator process > [23:13:01] : [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] > Elected as the leading master! > [23:13:01] : [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] > Recovering from registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] > Recovering registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] > Successfully fetched the registry (0B) in 210944ns > [23:13:01] : [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] > Applied 1 operations in 5006ns; attempting to update the registry > [23:13:01] : [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] > Successfully updated the registry in 194048ns > [23:13:01] : [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] > Successfully recovered
[jira] [Commented] (MESOS-8897) ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky
[ https://issues.apache.org/jira/browse/MESOS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476097#comment-16476097 ] James Peach commented on MESOS-8897: | [r67116|https://reviews.apache.org/r/67116/] | Change XFS Kill Test to use ASSERT_GE. | > ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky > - > > Key: MESOS-8897 > URL: https://issues.apache.org/jira/browse/MESOS-8897 > Project: Mesos > Issue Type: Bug > Components: flaky, test >Reporter: Yan Xu >Assignee: James Peach >Priority: Major > > {noformat:title=} > [ RUN ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill > meta-data=/dev/loop0 isize=256 agcount=2, agsize=5120 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=0 > data = bsize=4096 blocks=10240, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=1200, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > I0508 17:55:12.353438 13453 exec.cpp:162] Version: 1.7.0 > I0508 17:55:12.370332 13451 exec.cpp:236] Executor registered on agent > 49668ffa-2a69-4867-b31a-4972b4ac13d2-S0 > I0508 17:55:12.376093 13447 executor.cpp:178] Received SUBSCRIBED event > I0508 17:55:12.376771 13447 executor.cpp:182] Subscribed executor on > mesos.vagrant > I0508 17:55:12.377038 13447 executor.cpp:178] Received LAUNCH event > I0508 17:55:12.381901 13447 executor.cpp:665] Starting task > edb798b4-1b16-4de4-828c-0db132df70ab > I0508 17:55:12.387936 13447 executor.cpp:485] Running > '/tmp/mesos-build/mesos/build/src/mesos-containerizer launch > ' > I0508 17:55:12.392854 13447 executor.cpp:678] Forked command at 13456 > 2+0 records in > 2+0 records out > 2097152 bytes (2.1 MB) copied, 0.00404074 s, 519 MB/s > ../../src/tests/containerizer/xfs_quota_tests.cpp:618: Failure > Expected: (limit.disk().get()) > (Megabytes(1)), actual: 1MB vs 1MB > [ FAILED ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill (1182 ms) > {noformat} > [~jpe...@apache.org] mentioned that > {code} > 409 // If the soft limit is exceeded the container should be killed. > 410 if (quotaInfo->used > quotaInfo->softLimit) { > 411 Resource resource; > 412 resource.set_name("disk"); > 413 resource.set_type(Value::SCALAR); > 414 resource.mutable_scalar()->set_value( > 415 quotaInfo->used.bytes() / Bytes::MEGABYTES); > 416 > 417 info->limitation.set( > 418 protobuf::slave::createContainerLimitation( > 419 Resources(resource), > 420 "Disk usage (" + stringify(quotaInfo->used) + > 421 ") exceeds quota (" + > 422 stringify(quotaInfo->softLimit) + ")", > 423 TaskStatus::REASON_CONTAINER_LIMITATION_DISK)); > 424 } > 425 } > {code} > Converting to MB is rounding down, so we report less space than was actually > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8897) ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky
[ https://issues.apache.org/jira/browse/MESOS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8897: -- Assignee: James Peach > ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky > - > > Key: MESOS-8897 > URL: https://issues.apache.org/jira/browse/MESOS-8897 > Project: Mesos > Issue Type: Bug > Components: flaky, test >Reporter: Yan Xu >Assignee: James Peach >Priority: Major > > {noformat:title=} > [ RUN ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill > meta-data=/dev/loop0 isize=256 agcount=2, agsize=5120 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=0 > data = bsize=4096 blocks=10240, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=1200, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > I0508 17:55:12.353438 13453 exec.cpp:162] Version: 1.7.0 > I0508 17:55:12.370332 13451 exec.cpp:236] Executor registered on agent > 49668ffa-2a69-4867-b31a-4972b4ac13d2-S0 > I0508 17:55:12.376093 13447 executor.cpp:178] Received SUBSCRIBED event > I0508 17:55:12.376771 13447 executor.cpp:182] Subscribed executor on > mesos.vagrant > I0508 17:55:12.377038 13447 executor.cpp:178] Received LAUNCH event > I0508 17:55:12.381901 13447 executor.cpp:665] Starting task > edb798b4-1b16-4de4-828c-0db132df70ab > I0508 17:55:12.387936 13447 executor.cpp:485] Running > '/tmp/mesos-build/mesos/build/src/mesos-containerizer launch > ' > I0508 17:55:12.392854 13447 executor.cpp:678] Forked command at 13456 > 2+0 records in > 2+0 records out > 2097152 bytes (2.1 MB) copied, 0.00404074 s, 519 MB/s > ../../src/tests/containerizer/xfs_quota_tests.cpp:618: Failure > Expected: (limit.disk().get()) > (Megabytes(1)), actual: 1MB vs 1MB > [ FAILED ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill (1182 ms) > {noformat} > [~jpe...@apache.org] mentioned that > {code} > 409 // If the soft limit is exceeded the container should be killed. > 410 if (quotaInfo->used > quotaInfo->softLimit) { > 411 Resource resource; > 412 resource.set_name("disk"); > 413 resource.set_type(Value::SCALAR); > 414 resource.mutable_scalar()->set_value( > 415 quotaInfo->used.bytes() / Bytes::MEGABYTES); > 416 > 417 info->limitation.set( > 418 protobuf::slave::createContainerLimitation( > 419 Resources(resource), > 420 "Disk usage (" + stringify(quotaInfo->used) + > 421 ") exceeds quota (" + > 422 stringify(quotaInfo->softLimit) + ")", > 423 TaskStatus::REASON_CONTAINER_LIMITATION_DISK)); > 424 } > 425 } > {code} > Converting to MB is rounding down, so we report less space than was actually > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8913) Resource provider leaks file descriptors into executors.
James Peach created MESOS-8913: -- Summary: Resource provider leaks file descriptors into executors. Key: MESOS-8913 URL: https://issues.apache.org/jira/browse/MESOS-8913 Project: Mesos Issue Type: Task Components: agent, security Reporter: James Peach I have an executor that closes unknown file descriptors when it starts up: {noformat} 2018/05/14 20:54:43.210293 util_linux.go:65: closing extraneous fd 126 (/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/08.log) 2018/05/14 20:54:43.210345 util_linux.go:47: unable to call fcntl() to get fd options for fd 3: errno bad file descriptor 2018/05/14 20:54:43.210385 util_linux.go:65: closing extraneous fd 321 (/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/LOG) 2018/05/14 20:54:43.210438 util_linux.go:65: closing extraneous fd 322 (/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/LOCK) 2018/05/14 20:54:43.210501 util_linux.go:65: closing extraneous fd 324 (/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/MANIFEST-06) {noformat} It is closing leveldb descriptors leaked by the resource provider. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8907) curl fetcher fails with HTTP/2
James Peach created MESOS-8907: -- Summary: curl fetcher fails with HTTP/2 Key: MESOS-8907 URL: https://issues.apache.org/jira/browse/MESOS-8907 Project: Mesos Issue Type: Task Components: fetcher Reporter: James Peach {noformat} [ RUN ] ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 ... I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 'quay.io/coreos/alpine-sh' from 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn' E0510 20:52:00.756072 25003 slave.cpp:6176] Container '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode HTTP responses: Decoding failed HTTP/2 200 server: nginx/1.13.12 date: Fri, 11 May 2018 03:52:00 GMT content-type: application/vnd.docker.distribution.manifest.v1+prettyjws content-length: 4486 docker-content-digest: sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663 x-frame-options: DENY strict-transport-security: max-age=63072000; preload ... {noformat} Note that curl is saying the HTTP version is "HTTP/2". This happens on modern curl that automatically negotiates HTTP/2, but the docker fetcher isn't prepared to parse that. {noformat} $ curl -i --raw -L -s -S -o - 'http://quay.io/coreos/alpine-sh?latest#https' HTTP/1.1 301 Moved Permanently Content-Type: text/html Date: Fri, 11 May 2018 04:07:44 GMT Location: https://quay.io/coreos/alpine-sh?latest Server: nginx/1.13.12 Content-Length: 186 Connection: keep-alive HTTP/2 301 server: nginx/1.13.12 date: Fri, 11 May 2018 04:07:45 GMT content-type: text/html; charset=utf-8 content-length: 287 location: https://quay.io/coreos/alpine-sh/?latest x-frame-options: DENY strict-transport-security: max-age=63072000; preload {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8792) Automatically create whitelisted devices.
[ https://issues.apache.org/jira/browse/MESOS-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470558#comment-16470558 ] James Peach commented on MESOS-8792: As per design doc, the way forward on this is a new {{linux/devices}} isolator. The initial implementation will share the {{\-\-allowed_devices}} configuration flag so that it will automatically work in concert with the {{cgroups/devices}} isolator. However the mechanism is general enough that we can later build on it to enable per-container devices. > Automatically create whitelisted devices. > - > > Key: MESOS-8792 > URL: https://issues.apache.org/jira/browse/MESOS-8792 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When the operator configures the {{\-\-allowed_devices}} agent flag, the > devices cgroup is configured but the task still needs to actually create the > device node. This is awkward because the task might not have enough > capabilities to {{mknod}} and even if we wanted to grant the capabilities, > the application may need to be modified to make the right system calls. > We should enhance the isolator and containerizer to automatically create > device nodes that have been whitelisted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466073#comment-16466073 ] James Peach commented on MESOS-6575: {noformat} commit 081c3114fefa18c6acd1e884e6d6583232e30d5c Author: Harold DostDate: Mon May 7 08:39:29 2018 -0700 Documented the `--xfs-kill-containers` flag. Added a description of the `--xfs-kill-containers` flag to the `disk/xfs` isolator page and listed it in the upgrade documentation. Review: https://reviews.apache.org/r/66975/ {noformat} > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > Fix For: 1.6.0 > > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8865) Suspicious enum value comparisons in scheduler Java bindings
[ https://issues.apache.org/jira/browse/MESOS-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8865: -- Assignee: Benjamin Bannier > Suspicious enum value comparisons in scheduler Java bindings > > > Key: MESOS-8865 > URL: https://issues.apache.org/jira/browse/MESOS-8865 > Project: Mesos > Issue Type: Bug > Components: java api >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Major > > Clang reports suspicious comparisons of enum values in the scheduler Java > bindings, > {noformat} > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:563:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::SUBSCRIBE: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:576:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::TEARDOWN: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:581:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::ACCEPT: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:601:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::ACCEPT_INVERSE_OFFERS: > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:602:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::DECLINE_INVERSE_OFFERS: > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:603:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::SHUTDOWN: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:609:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::DECLINE: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:621:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::REVIVE: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:626:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::KILL: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:631:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type' (aka 'const > mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch] > case Call::ACKNOWLEDGE: { > ^ > /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:642:10: > warning: comparison of two values with different enumeration types in switch > statement ('::mesos::scheduler::Call_Type' and 'const > mesos::v1::scheduler::Call::Type'
[jira] [Commented] (MESOS-8792) Automatically create whitelisted devices.
[ https://issues.apache.org/jira/browse/MESOS-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461446#comment-16461446 ] James Peach commented on MESOS-8792: I have some preliminary patches for this and have experimented a bit. The major conceptual problem here is that if we are creating the device nodes at the time when we construct the chroot, the process is already running in cgroups (specifically the devices cgroup). This means that the devices cgroup must allow the {{mknod}} permission; you can't just specify read+write devices. > Automatically create whitelisted devices. > - > > Key: MESOS-8792 > URL: https://issues.apache.org/jira/browse/MESOS-8792 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When the operator configures the {{\-\-allowed_devices}} agent flag, the > devices cgroup is configured but the task still needs to actually create the > device node. This is awkward because the task might not have enough > capabilities to {{mknod}} and even if we wanted to grant the capabilities, > the application may need to be modified to make the right system calls. > We should enhance the isolator and containerizer to automatically create > device nodes that have been whitelisted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459045#comment-16459045 ] James Peach commented on MESOS-6575: | [/r/66173|https://reviews.apache.org/r/66173/] | Added test for `disk/xfs` container limitation. | | [r/66001|https://reviews.apache.org/r/66001/]| Added soft limit and kill to `disk/xfs`. | > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > Fix For: 1.6.0 > > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8792) Automatically create whitelisted devices.
James Peach created MESOS-8792: -- Summary: Automatically create whitelisted devices. Key: MESOS-8792 URL: https://issues.apache.org/jira/browse/MESOS-8792 Project: Mesos Issue Type: Improvement Components: cgroups, containerization Reporter: James Peach Assignee: James Peach When the operator configures the {{\-\-allowed_devices}} agent flag, the devices cgroup is configured but the task still needs to actually create the device node. This is awkward because the task might not have enough capabilities to {{mknod}} and even if we wanted to grant the capabilities, the application may need to be modified to make the right system calls. We should enhance the isolator and containerizer to automatically create device nodes that have been whitelisted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8531) Some task status updates sent by the default executor don't contain a REASON.
[ https://issues.apache.org/jira/browse/MESOS-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430944#comment-16430944 ] James Peach commented on MESOS-8531: This refers to the status updates that are sent when the default executor tears down a task group in response to a single task failing. In slack, we discussed defining a separate reason field that would be used to make it more explicit that a particular task was killed because the group failed (in some sense). > Some task status updates sent by the default executor don't contain a REASON. > - > > Key: MESOS-8531 > URL: https://issues.apache.org/jira/browse/MESOS-8531 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.2.3, 1.3.1, 1.4.1, 1.5.0 >Reporter: Gastón Kleiman >Priority: Major > Labels: default-executor, mesosphere > > The default executor doesn't set a reason when sending {{TASK_KILLING}}, > {{TASK_KILLED}}, > and {{TASK_FAILED}} task status update. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8763) Enable -Wshadow in the build.
James Peach created MESOS-8763: -- Summary: Enable -Wshadow in the build. Key: MESOS-8763 URL: https://issues.apache.org/jira/browse/MESOS-8763 Project: Mesos Issue Type: Improvement Components: build Reporter: James Peach Shadowed variables are a source of confusion and bugs. We should enable {{-Wshadow}} and eliminated these permanently. We would need to solve the shadowing issues that we get from our 3rd party dependencies. {noformat} In file included from ../../src/common/protobuf_utils.cpp:28: In file included from ../../include/mesos/slave/isolator.hpp:27: In file included from ../../3rdparty/libprocess/include/process/dispatch.hpp:20: ../../3rdparty/libprocess/include/process/process.hpp:242:54: error: declaration shadows a field of 'process::ProcessBase' [-Werror,-Wshadow] void delegate(const std::string& name, const UPID& pid) ^ ../../3rdparty/libprocess/include/process/process.hpp:488:8: note: previous declaration is here UPID pid; ^ In file included from ../../src/common/protobuf_utils.cpp:53: In file included from ../../src/master/master.hpp:51: ../../3rdparty/libprocess/include/process/protobuf.hpp:460:12: error: declaration shadows a local variable [-Werror,-Wshadow] { Req* req = nullptr; google::protobuf::Message* m = req; (void)m; } ^ ../../3rdparty/libprocess/include/process/protobuf.hpp:457:18: note: previous declaration is here const Req& req) const ^ In file included from ../../src/common/protobuf_utils.cpp:53: In file included from ../../src/master/master.hpp:54: In file included from ../../3rdparty/libprocess/include/process/metrics/counter.hpp:19: In file included from ../../3rdparty/libprocess/include/process/metrics/metric.hpp:22: In file included from ../../3rdparty/libprocess/include/process/statistics.hpp:21: ../../3rdparty/libprocess/include/process/timeseries.hpp:106:24: error: declaration shadows a field of 'TimeSeries' [-Werror,-Wshadow] std::vector values; ^ ../../3rdparty/libprocess/include/process/timeseries.hpp:242:21: note: previous declaration is here std::map
[jira] [Commented] (MESOS-8716) Freezer controller is not returned to thaw if task termination fails
[ https://issues.apache.org/jira/browse/MESOS-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408669#comment-16408669 ] James Peach commented on MESOS-8716: Here's a stack trace that is symptomatic of this problem: {noformat} 2018-03-21T04:31:49.272492+00:00 mslave1218 kernel: [3969040.584460] Call Trace: 2018-03-21T04:31:49.272494+00:00 mslave1218 kernel: [3969040.587253] [] schedule+0x39/0x90 2018-03-21T04:31:49.283684+00:00 mslave1218 kernel: [3969040.592551] [] __refrigerator+0x4d/0x140 2018-03-21T04:31:49.283689+00:00 mslave1218 kernel: [3969040.598458] [] get_signal+0x36d/0x390 2018-03-21T04:31:49.294814+00:00 mslave1218 kernel: [3969040.604103] [] do_signal+0x20/0x130 2018-03-21T04:31:49.294820+00:00 mslave1218 kernel: [3969040.609576] [] ? freezing_slow_path+0x4d/0x80 2018-03-21T04:31:49.306702+00:00 mslave1218 kernel: [3969040.615939] [] ? SyS_wait4+0xa9/0xf0 2018-03-21T04:31:49.306706+00:00 mslave1218 kernel: [3969040.621495] [] ? is_current_pgrp_orphaned+0xe0/0xe0 2018-03-21T04:31:49.319554+00:00 mslave1218 kernel: [3969040.628358] [] do_notify_resume+0x58/0x70 2018-03-21T04:31:49.319559+00:00 mslave1218 kernel: [3969040.634351] [] int_signal+0x12/0x17 {noformat} > Freezer controller is not returned to thaw if task termination fails > > > Key: MESOS-8716 > URL: https://issues.apache.org/jira/browse/MESOS-8716 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.3.2 >Reporter: Sargun Dhillon >Priority: Major > > This issue is related to https://issues.apache.org/jira/browse/MESOS-8004. A > container may fail to terminate for a variety of reasons. One common reason > in our system is when containers rely on external storage, they run fsync > before exiting (fsync on SIGTERM). This makes it so that the termination can > timeout. > > Even though Mesos has sent the requisite kill signals, the task will never > terminate because the cgroup stays frozen. > > The intended behaviour should be that on failure to terminate, if the pids > isolator is running, pids.max should be set to 0, to prevent further > processes from being created, the cgroup should be walked and sigkilled, and > then thawed. Once the processes finish thawing, the kill signal will be > delivered, and processed, resulting in the container finally finishing, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6555) Namespace 'mnt' is not supported
[ https://issues.apache.org/jira/browse/MESOS-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388251#comment-16388251 ] James Peach edited comment on MESOS-6555 at 3/20/18 4:53 PM: - | [r/66175|https://reviews.apache.org/r/66175] | Added isolator checks for namespaces support. | was (Author: jamespeach): | [r/65932|https://reviews.apache.org/r/65932] | Added a generic mechanism to check for isolator requirements. | > Namespace 'mnt' is not supported > > > Key: MESOS-6555 > URL: https://issues.apache.org/jira/browse/MESOS-6555 > Project: Mesos > Issue Type: Bug > Components: cgroups, containerization >Affects Versions: 1.0.0, 1.2.3, 1.3.1, 1.4.1, 1.5.0 > Environment: suse11 sp3, kernal: 3.0.101-0.47.71-default #1 SMP Thu > Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux >Reporter: AndyPang >Assignee: James Peach >Priority: Minor > Fix For: 1.6.0 > > > the same code run in debain os,kernal version is '4.1.0-0' is ok; while in > sus 11 sp3 it has error. > {code:title=mesos-execute|borderStyle=solid} > ./mesos-execute --command="sleep 100" --master=:xxx --name=sleep > --docker_image=ubuntu > I1105 11:26:21.090703 194814 scheduler.cpp:172] Version: 1.0.0 > I1105 11:26:21.092821 194837 scheduler.cpp:461] New master detected at > master@:xxx > Subscribed with ID 'fdb8546d-ca11-4a51-a297-8401e53b7692-' > Submitted task 'sleep' to agent 'fdb8546d-ca11-4a51-a297-8401e53b7692-S0' > Received status update TASK_FAILED for task 'sleep' > message: 'Failed to launch container: Collect failed: Failed to setup > hostname and network files: Failed to enter the mount namespace of pid > 194976: Namespace 'mnt' is not supported > ; Executor terminated' > source: SOURCE_AGENT > reason: REASON_CONTAINER_LAUNCH_FAILED > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors
[ https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8609: -- Assignee: James Peach (was: Zhitao Li) > Create a metric to indicate how long agent takes to recover executors > - > > Key: MESOS-8609 > URL: https://issues.apache.org/jira/browse/MESOS-8609 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Zhitao Li >Assignee: James Peach >Priority: Minor > Labels: Metrics, agent > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors
[ https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8609: -- Assignee: Zhitao Li (was: James Peach) > Create a metric to indicate how long agent takes to recover executors > - > > Key: MESOS-8609 > URL: https://issues.apache.org/jira/browse/MESOS-8609 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Minor > Labels: Metrics, agent > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393177#comment-16393177 ] James Peach commented on MESOS-6575: {quote} I guess I don't understand the opposition to having the soft limit as in the current implementation the soft limit is being set, but it happens to be set to the exact amount as the hard limit. The advantage of the soft limit is that we don't have to keep track of how long has something been over the soft limit, we perform the system call which provides us a time when the grace period is over and once that occurs we can kill the application. {quote} My reasoning is that it doesn't matter how long the task has exceeded the allocated limit for. The `disk/du` isolator doesn't wait for you to be over the quota for any length of time - the task is terminated as soon as the violation is detected. It's certainly possible to set a different soft limit, but I can't see how it helps. The isolator still needs to poll on an interval and verify the used space. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391804#comment-16391804 ] James Peach commented on MESOS-6575: > James Peach Would you be able to act as the shepherd for getting this patch > in? Yes I can shepherd. However, I don't think that setting the soft limit is the right approach. I can't see a scenario where it is actually needed. If the isolator needs to poll (and it almost certainly does), then all it needs to do is to compare the actual disk usage against the allocated disk resource. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389066#comment-16389066 ] James Peach edited comment on MESOS-6918 at 3/7/18 6:01 AM: {quote} [~jamespeach], do you think it's feasible to target some of this work for 1.6? {quote} Yes I think it's doable. was (Author: jamespeach): > [~jamespeach], do you think it's feasible to target some of this work for 1.6? Yes I think it's doable. > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389066#comment-16389066 ] James Peach commented on MESOS-6918: > [~jamespeach], do you think it's feasible to target some of this work for 1.6? Yes I think it's doable. > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195412#comment-16195412 ] James Peach edited comment on MESOS-6918 at 3/7/18 5:55 AM: Summary from our discussion: - retain the existing {{Timer}} value that holds the duration of the last sample - capture total duration (monotonic sum) for {{Timers}} in their time series - capture total sample count for {{Timers}} in their time series - replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or something) was (Author: jamespeach): Summary from our discussion: - retain the existing {{Timer}} value that holds the duration of the last sample - capture total duration (monotonic sum) for {{Timer}}s in their time series - capture total sample count for {{Timer}}s in their time series - replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or something) > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6128) Make "re-register" vs. "reregister" consistent in the master
[ https://issues.apache.org/jira/browse/MESOS-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-6128: -- Assignee: James Peach > Make "re-register" vs. "reregister" consistent in the master > > > Key: MESOS-6128 > URL: https://issues.apache.org/jira/browse/MESOS-6128 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Neil Conway >Assignee: James Peach >Priority: Trivial > Labels: mesosphere, newbie > > Per discussion in https://reviews.apache.org/r/50705/, we sometimes use > "re-register" in comments and elsewhere we use "reregister". We should pick > one form and use it consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382948#comment-16382948 ] James Peach commented on MESOS-6575: {quote} When the resource is updated in the xfs handler they are not tracked, but instead are added up. {quote} This is because the XFS isolator doesn't support path volumes so there's no need to track any paths. It might be interesting to refactor a unified way to tracking disk resource, as a prerequisite to any other XFS changes, but AFAICT that's not actually required here. {quote} The idea behind the "diff_bytes" would be that you'd take the hard limit of any given task and subtract that amount of bytes to create a soft_limit below the hard limit. {quote} Thinking about this some more, I'm not sure that we need to do anything with soft limits at all. Let's assume that we implement this for task sandboxes by applying a hard limit that is "disk_resource + some_constant_slop". We still need to have the isolator periodically check the usage in order to raise the limitation, so it doesn't really matter whether we have a soft limit. All we really need to do is check the current usage against the resource limit. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8610) NsTest.SupportedNamespaces fails on CentOS7
[ https://issues.apache.org/jira/browse/MESOS-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8610: -- Assignee: James Peach Component/s: test | [r/65804|https://reviews.apache.org/r/65804] | Fixed a typo in the NsTest.SupportedNamespaces test. | > NsTest.SupportedNamespaces fails on CentOS7 > --- > > Key: MESOS-8610 > URL: https://issues.apache.org/jira/browse/MESOS-8610 > Project: Mesos > Issue Type: Bug > Components: test > Environment: CentOS 7 >Reporter: Jan Schlicht >Assignee: James Peach >Priority: Major > Labels: flaky-test > > Failed on a {{GLOG_v=1 src/mesos-tests --verbose}} run with > {noformat} > [ RUN ] NsTest.SupportedNamespaces > ../../src/tests/containerizer/ns_tests.cpp:119: Failure > Value of: (ns::supported(n)).get() > Actual: false > Expected: true > Which is: true > CLONE_NEWUSER > ../../src/tests/containerizer/ns_tests.cpp:124: Failure > Value of: (ns::supported(allNamespaces)).get() > Actual: false > Expected: true > Which is: true > CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER > [ FAILED ] NsTest.SupportedNamespaces (0 ms) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8559) Add a default disk resource flag option.
[ https://issues.apache.org/jira/browse/MESOS-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8559: -- Assignee: (was: James Peach) > Add a default disk resource flag option. > > > Key: MESOS-8559 > URL: https://issues.apache.org/jira/browse/MESOS-8559 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Minor > > Since in MESOS-8558 we are documenting the current semantics that an absent > disk resource means that the task has no disk usage restrictions, consider > adding a new agent flag that would let operators specify a default disk usage > amount for tasks that are launched without any disk resource. Alternatively, > we could validate (on the master) that tasks always have a minimum resource > profile. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8559) Add a default disk resource flag option.
[ https://issues.apache.org/jira/browse/MESOS-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8559: -- Assignee: James Peach > Add a default disk resource flag option. > > > Key: MESOS-8559 > URL: https://issues.apache.org/jira/browse/MESOS-8559 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > Since in MESOS-8558 we are documenting the current semantics that an absent > disk resource means that the task has no disk usage restrictions, consider > adding a new agent flag that would let operators specify a default disk usage > amount for tasks that are launched without any disk resource. Alternatively, > we could validate (on the master) that tasks always have a minimum resource > profile. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User
[ https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365805#comment-16365805 ] James Peach commented on MESOS-8585: Yeh, crashing in this case seems pretty unfortunate. Probably `createExecutorDirectory` should return an error and we should refactor the callers to be able to propagate that correctly. > Agent Crashes When Ask to Start Task with Unknown User > -- > > Key: MESOS-8585 > URL: https://issues.apache.org/jira/browse/MESOS-8585 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.5.0 >Reporter: Karsten >Priority: Major > Attachments: dcos-mesos-slave.service.1.gz, > dcos-mesos-slave.service.2.gz > > > The Marathon team has an integration test that tries to start a task with an > unknown user. The test expects a \{{TASK_FAILED}}. However, we see > \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent > crashes and restarts. > > {code} > 783 2018-02-14 14:55:45: I0214 14:55:45.319974 6213 slave.cpp:2542] > Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for > framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001 > 784 2018-02-14 14:55:45: I0214 14:55:45.320605 6213 paths.cpp:727] > Creating sandbox > '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05 > 784 > a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88' > for user 'bad' > 785 2018-02-14 14:55:45: F0214 14:55:45.321131 6213 paths.cpp:735] > CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' > Failed to create executor directory '/var/lib/mesos/slave/ > 785 > slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6 > 785 66d4acc88' > 786 2018-02-14 14:55:45: *** Check failure stack trace: *** > 787 2018-02-14 14:55:45: @ 0x7f72033444ad > google::LogMessage::Fail() > 788 2018-02-14 14:55:45: @ 0x7f72033462dd > google::LogMessage::SendToLog() > 789 2018-02-14 14:55:45: @ 0x7f720334409c > google::LogMessage::Flush() > 790 2018-02-14 14:55:45: @ 0x7f7203346bd9 > google::LogMessageFatal::~LogMessageFatal() > 791 2018-02-14 14:55:45: @ 0x56544ca378f9 > _CheckFatal::~_CheckFatal() > 792 2018-02-14 14:55:45: @ 0x7f720270f30d > mesos::internal::slave::paths::createExecutorDirectory() > 793 2018-02-14 14:55:45: @ 0x7f720273812c > mesos::internal::slave::Framework::addExecutor() > 794 2018-02-14 14:55:45: @ 0x7f7202753e35 > mesos::internal::slave::Slave::__run() > 795 2018-02-14 14:55:45: @ 0x7f7202764292 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4 > 795 > listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1 > 795 > 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_ > 796 2018-02-14 14:55:45: @ 0x7f72032a2b11 > process::ProcessBase::consume() > 797 2018-02-14 14:55:45: @ 0x7f72032b183c > process::ProcessManager::resume() > 798 2018-02-14 14:55:45: @ 0x7f72032b6da6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 799 2018-02-14 14:55:45: @ 0x7f72005ced73 (unknown) > 800 2018-02-14 14:55:45: @ 0x7f72000cf52c (unknown) > 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd (unknown) > 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, > code=killed, status=6/ABRT > 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed > state. > 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result > 'signal'. > 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time > over, scheduling restart. > 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel > agent. > 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel > agent... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8558) Document semantics of absent disk resources
[ https://issues.apache.org/jira/browse/MESOS-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8558: -- Assignee: James Peach > Document semantics of absent disk resources > --- > > Key: MESOS-8558 > URL: https://issues.apache.org/jira/browse/MESOS-8558 > Project: Mesos > Issue Type: Documentation > Components: containerization, documentation >Reporter: James Peach >Assignee: James Peach >Priority: Major > > In the Containerizer Working Group, we decided that we should simply document > the current semantics of how disk resources are enforced when schedulers > don't specify any disk resource for their tasks. We agreed that we should > simply document the current semantics where this results in a task with no > disk usage restrictions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8559) Add a default disk resource flag option.
James Peach created MESOS-8559: -- Summary: Add a default disk resource flag option. Key: MESOS-8559 URL: https://issues.apache.org/jira/browse/MESOS-8559 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach Since in MESOS-8558 we are documenting the current semantics that an absent disk resource means that the task has no disk usage restrictions, consider adding a new agent flag that would let operators specify a default disk usage amount for tasks that are launched without any disk resource. Alternatively, we could validate (on the master) that tasks always have a minimum resource profile. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8558) Document semantics of absent disk resources
James Peach created MESOS-8558: -- Summary: Document semantics of absent disk resources Key: MESOS-8558 URL: https://issues.apache.org/jira/browse/MESOS-8558 Project: Mesos Issue Type: Documentation Components: containerization, documentation Reporter: James Peach In the Containerizer Working Group, we decided that we should simply document the current semantics of how disk resources are enforced when schedulers don't specify any disk resource for their tasks. We agreed that we should simply document the current semantics where this results in a task with no disk usage restrictions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8313) Provide a host namespace container supervisor.
[ https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358625#comment-16358625 ] James Peach commented on MESOS-8313: Note, this supervisor need to read all its children, as per MESOS-5893. > Provide a host namespace container supervisor. > -- > > Key: MESOS-8313 > URL: https://issues.apache.org/jira/browse/MESOS-8313 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > Attachments: IMG_2629.JPG > > > After more investigation on user namespaces, the current implementation of > creating the container namespaces needs some adjustment before we can > implement user namespaces in a useable fashion. > The problems we need to address are: > 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace > to mount {{procfs}}. Currently, this prevents containers joining the host PID > namespace. The workaround is to always create a new container PID namespace > (as a child of the user namespace) with the {{namespaces/pid}} isolator. > 2. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the network > namespace to mount {{sysfs}}. There's no general workaround for this since we > can't generally require containers to not join the host network namespace. > 3. The containerizer can't enter a user namespace after entering the > {{chroot}}. This restriction makes the existing order of containerizer > operations impossible to remain in the case where we want the executor to be > in a new user namespace that has no children (i.e. to protect the container > from a privileged task). > After some discussion with [~jieyu], we believe that we can some most or all > of these issues by creating a new containerized supervisor that runs fully > outside the container and is responsible for constructing the roots mount > namespace, launching the containerized to enter the rest of the container, > and waiting on the entered process. > Since this new supervisor process is not running in the user namespace, it > will be able to construct the container rootfs in a new mount namespace > without user namespace restrictions. We can then clone a child to fully > create and enter container namespaces along with the prefabricated rootfs > mount namespace. > The only drawback to this approach is that the container's mount namespace > will be owned by the root user namespace rather than the container user > namespace. We are OK with this for now. > The plan here is to retain the existing {{mesos-containerizer launch}} > subcommand and add a new {{mesos-containerizer supervise}} subcommand, which > will be its parent process. This new subcommand will be used for the default > executor and custom executor code paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5893) mesos-executor should adopt and reap orphan child processes
[ https://issues.apache.org/jira/browse/MESOS-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358622#comment-16358622 ] James Peach commented on MESOS-5893: The host namespace supervisor tracked in MESOS-8313 will make itself a reaper and reap all container processes. > mesos-executor should adopt and reap orphan child processes > --- > > Key: MESOS-5893 > URL: https://issues.apache.org/jira/browse/MESOS-5893 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.1.0 > Environment: mesos compiled from git master ( 1.1.0 ) > {{../configure --enable-ssl --enable-libevent --prefix=/usr --enable-optimize > --enable-silent-rules --enable-xfs-disk-isolator}} > isolators : > {{namespaces/pid,cgroups/cpu,cgroups/mem,filesystem/linux,docker/runtime,network/cni,docker/volume}} >Reporter: Stéphane Cottin >Priority: Major > Labels: containerizer > > mesos containerizer does not properly handle children death. > discovered using marathon-lb, each topology update fork another haproxy, the > old haproxy process should properly die after its last client connection is > terminated, but turn into a zombie. > {noformat} > 7716 ?Ssl0:00 | \_ mesos-executor > --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox > --user=root --working_directory=/marathon-lb > --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491 > 7813 ?Ss 0:00 | | \_ sh -c /marathon-lb/run sse > --marathon https://marathon:8443 --auth-credentials user:pass --group > 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050 > 7823 ?S 0:00 | | | \_ /bin/bash /marathon-lb/run sse > --marathon https://marathon:8443 --auth-credentials user:pass --group > external --ssl-certs /certs --max-serv-port-ip-per-task 20050 > 7827 ?S 0:00 | | | \_ /usr/bin/runsv > /marathon-lb/service/haproxy > 7829 ?S 0:00 | | | | \_ /bin/bash ./run > 8879 ?S 0:00 | | | | \_ sleep 0.5 > 7828 ?Sl 0:00 | | | \_ python3 > /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config > /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload > /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 > --auth-credentials user:pass --group external --max-serv-port-ip-per-task > 20050 > 7906 ?Zs 0:00 | | \_ [haproxy] > 8628 ?Zs 0:00 | | \_ [haproxy] > 8722 ?Ss 0:00 | | \_ haproxy -p /tmp/haproxy.pid -f > /marathon-lb/haproxy.cfg -D -sf 144 52 > {noformat} > update: mesos-executor should be registered as a subreaper ( > http://man7.org/linux/man-pages/man2/prctl.2.html ) and propagate signals. > code sample: https://github.com/krallin/tini/blob/master/src/tini.c -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8547) Mount devpts with compatible defaults.
[ https://issues.apache.org/jira/browse/MESOS-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354330#comment-16354330 ] James Peach commented on MESOS-8547: Note to self - we should also set something like {{max=1024}} since otherwise the default max for devpts is 2^20, which seems unreasonably high for an untrusted container. > Mount devpts with compatible defaults. > -- > > Key: MESOS-8547 > URL: https://issues.apache.org/jira/browse/MESOS-8547 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > The Mesos containerizer mounts {{devpts}} with the following options: > {noformat} > newinstance,ptmxmode=0666 > {noformat} > Some versions of glibc (e.g. > [2.17|https://github.com/bminor/glibc/blob/glibc-2.17/sysdeps/unix/grantpt.c#L158] > from CentOS 7) are hard-coded to expect that terminal devices are owned by > the {{tty}} group, so this causes containers that allocate TTYs to expect to > have to chown the TTY (see grantpt code in glibc). > Docker uses the following {{devpts}} default: > {noformat} > Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", > "mode=0620", "gid=5"}, > {noformat} > I can think of a number of options > # hard-code the "gid=5" option > # look up the "tty" group from the host > # propagate the devpts mount options from the host > # look up the "tty" group from the container > # make it the operator's problem (i.e. add configuration) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8549) Notification program for manual intervention.
James Peach created MESOS-8549: -- Summary: Notification program for manual intervention. Key: MESOS-8549 URL: https://issues.apache.org/jira/browse/MESOS-8549 Project: Mesos Issue Type: Bug Components: agent Reporter: James Peach If the Mesos agent needs manual intervention to start (e.g. because the resources or attributes changed), the agent will refuse to start. However, it's not that obvious to operational system what is happening because mostly they will just observe that the agent is down and not be able to describe why it is down. One way to address this is for the agent to execute a program when this happens. Operators could then specify a program that updates the agent state in any relevant systems, which would make it easier to take the appropriate actions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8547) Mount devpts with compatible defaults.
[ https://issues.apache.org/jira/browse/MESOS-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353190#comment-16353190 ] James Peach commented on MESOS-8547: [This LWN article|https://lwn.net/Articles/688809/] explains the background pretty well. > Mount devpts with compatible defaults. > -- > > Key: MESOS-8547 > URL: https://issues.apache.org/jira/browse/MESOS-8547 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > The Mesos containerizer mounts {{devpts}} with the following options: > {noformat} > newinstance,ptmxmode=0666 > {noformat} > Some versions of glibc (e.g. > [2.17|https://github.com/bminor/glibc/blob/glibc-2.17/sysdeps/unix/grantpt.c#L158] > from CentOS 7) are hard-coded to expect that terminal devices are owned by > the {{tty}} group, so this causes containers that allocate TTYs to expect to > have to chown the TTY (see grantpt code in glibc). > Docker uses the following {{devpts}} default: > {noformat} > Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", > "mode=0620", "gid=5"}, > {noformat} > I can think of a number of options > # hard-code the "gid=5" option > # look up the "tty" group from the host > # propagate the devpts mount options from the host > # look up the "tty" group from the container > # make it the operator's problem (i.e. add configuration) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8547) Mount devpts with compatible defaults.
James Peach created MESOS-8547: -- Summary: Mount devpts with compatible defaults. Key: MESOS-8547 URL: https://issues.apache.org/jira/browse/MESOS-8547 Project: Mesos Issue Type: Bug Components: containerization Reporter: James Peach Assignee: James Peach The Mesos containerizer mounts {{devpts}} with the following options: {noformat} newinstance,ptmxmode=0666 {noformat} Some versions of glibc (e.g. [2.17|https://github.com/bminor/glibc/blob/glibc-2.17/sysdeps/unix/grantpt.c#L158] from CentOS 7) are hard-coded to expect that terminal devices are owned by the {{tty}} group, so this causes containers that allocate TTYs to expect to have to chown the TTY (see grantpt code in glibc). Docker uses the following {{devpts}} default: {noformat} Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5"}, {noformat} I can think of a number of options # hard-code the "gid=5" option # look up the "tty" group from the host # propagate the devpts mount options from the host # look up the "tty" group from the container # make it the operator's problem (i.e. add configuration) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8313) Provide a host namespace container supervisor.
[ https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8313: --- Description: After more investigation on user namespaces, the current implementation of creating the container namespaces needs some adjustment before we can implement user namespaces in a useable fashion. The problems we need to address are: 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace to mount {{procfs}}. Currently, this prevents containers joining the host PID namespace. The workaround is to always create a new container PID namespace (as a child of the user namespace) with the {{namespaces/pid}} isolator. 2. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the network namespace to mount {{sysfs}}. There's no general workaround for this since we can't generally require containers to not join the host network namespace. 3. The containerizer can't enter a user namespace after entering the {{chroot}}. This restriction makes the existing order of containerizer operations impossible to remain in the case where we want the executor to be in a new user namespace that has no children (i.e. to protect the container from a privileged task). After some discussion with [~jieyu], we believe that we can some most or all of these issues by creating a new containerized supervisor that runs fully outside the container and is responsible for constructing the roots mount namespace, launching the containerized to enter the rest of the container, and waiting on the entered process. Since this new supervisor process is not running in the user namespace, it will be able to construct the container rootfs in a new mount namespace without user namespace restrictions. We can then clone a child to fully create and enter container namespaces along with the prefabricated rootfs mount namespace. The only drawback to this approach is that the container's mount namespace will be owned by the root user namespace rather than the container user namespace. We are OK with this for now. The plan here is to retain the existing {{mesos-containerizer launch}} subcommand and add a new {{mesos-containerizer supervise}} subcommand, which will be its parent process. This new subcommand will be used for the default executor and custom executor code paths. was: After more investigation on user namespaces, the current implementation of creating the container namespaces needs some adjustment before we can implement user namespaces in a useable fashion. The problems we need to address are: 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace to mount {{procfs}}. Currently, this prevents containers joining the host PID namespace. The workaround is to always create a new container PID namespace (as a child of the user namespace) with the {{namespaces/pid}} isolator. 2. The containerized needs to hold {{CAP_SYS_ADMIN}} over the network namespace to mount {{sysfs}}. There's no general workaround for this since we can't generally require containers to not join the host network namespace. 3. The containerizer can't enter a user namespace after entering the {{chroot}}. This restriction makes the existing order of containerizer operations impossible to remain in the case where we want the executor to be in a new user namespace that has no children (i.e. to protect the container from a privileged task). After some discussion with [~jieyu], we believe that we can some most or all of these issues by creating a new containerized supervisor that runs fully outside the container and is responsible for constructing the roots mount namespace, launching the containerized to enter the rest of the container, and waiting on the entered process. Since this new supervisor process is not running in the user namespace, it will be able to construct the container rootfs in a new mount namespace without user namespace restrictions. We can then clone a child to fully create and enter container namespaces along with the prefabricated rootfs mount namespace. The only drawback to this approach is that the container's mount namespace will be owned by the root user namespace rather than the container user namespace. We are OK with this for now. The plan here is to retain the existing {{mesos-containerizer launch}} subcommand and add a new {{mesos-containerizer supervise}} subcommand, which will be its parent process. This new subcommand will be used for the default executor and custom executor code paths. > Provide a host namespace container supervisor. > -- > > Key: MESOS-8313 > URL: https://issues.apache.org/jira/browse/MESOS-8313 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee:
[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking
[ https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348971#comment-16348971 ] James Peach commented on MESOS-7605: {quote} Qian Zhang That is exactly not the point of this change. CNI already supports setting the container hostname as for all containers that have an image. The point of this isolator is to guarantee that the host's UTS namespace is protected from containers (case 1) above. I kept it explicitly out of scope for this isolator to actually set the hostname, since last time I did that, we ended up moving that feature to the CNI isolator. {quote} I believed that the CNI isolator did set up the hostname correctly when joining the host network, however [~qianzhang] is right that the CNI isolator doesn't clone the UTS namespace unless you join a named network. So I agree with [~qianzhang] that we should make the CNI isolator clone the UTS namespace (and set the hostname) when it joins the host network and has a container image. We will still need the UTS isolator for the case where there is not a container image or the CNI isolator isn't used however. IIRC [~avinash.mesos]'s original concern about this was that the specified hostname would not be consistent with DNS. There's 2 things we can do about this ... (1) just accept it and it's fine, (2) resolve the host's hostname and use that IP address to populate the container {{resolv.conf}}. AFAICT, Docker just does (1). > UCR doesn't isolate uts namespace w/ host networking > > > Key: MESOS-7605 > URL: https://issues.apache.org/jira/browse/MESOS-7605 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James DeFelice >Assignee: James Peach >Priority: Major > Labels: mesosphere > > Docker's {{run}} command supports a {{--hostname}} parameter which impacts > container isolation, even in {{host}} network mode: (via > https://docs.docker.com/engine/reference/run/) > {quote} > Even in host network mode a container has its own UTS namespace by default. > As such --hostname is allowed in host network mode and will only change the > hostname inside the container. Similar to --hostname, the --add-host, --dns, > --dns-search, and --dns-option options can be used in host network mode. > {quote} > I see no evidence that UCR offers a similar isolation capability. > Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was > initially added to support the Docker containerizer's use of the > {{--hostname}} Docker {{run}} flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking
[ https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348786#comment-16348786 ] James Peach commented on MESOS-7605: [~qianzhang] That is exactly not the point of this change. CNI already supports setting the container hostname as for all containers that have an image. The point of this isolator is to guarantee that the host's UTS namespace is protected from containers (case 1) above. I kept it explicitly out of scope for this isolator to actually set the hostname, since last time I did that, we ended up moving that feature to the CNI isolator. > UCR doesn't isolate uts namespace w/ host networking > > > Key: MESOS-7605 > URL: https://issues.apache.org/jira/browse/MESOS-7605 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James DeFelice >Assignee: James Peach >Priority: Major > Labels: mesosphere > > Docker's {{run}} command supports a {{--hostname}} parameter which impacts > container isolation, even in {{host}} network mode: (via > https://docs.docker.com/engine/reference/run/) > {quote} > Even in host network mode a container has its own UTS namespace by default. > As such --hostname is allowed in host network mode and will only change the > hostname inside the container. Similar to --hostname, the --add-host, --dns, > --dns-search, and --dns-option options can be used in host network mode. > {quote} > I see no evidence that UCR offers a similar isolation capability. > Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was > initially added to support the Docker containerizer's use of the > {{--hostname}} Docker {{run}} flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8518) Make lost agent notifications optional for frameworks.
James Peach created MESOS-8518: -- Summary: Make lost agent notifications optional for frameworks. Key: MESOS-8518 URL: https://issues.apache.org/jira/browse/MESOS-8518 Project: Mesos Issue Type: Bug Components: master Reporter: James Peach When an agent is lost, not all frameworks really care, but there can be undesirable performance effect by suddenly sending a ton of messages all at one. Consider some mechanism for a framework to express that is doesn't care about the agent states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking
[ https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343766#comment-16343766 ] James Peach commented on MESOS-7605: [~jdef], [~qianzhang], [~avinash.mesos] Can any of you help review? > UCR doesn't isolate uts namespace w/ host networking > > > Key: MESOS-7605 > URL: https://issues.apache.org/jira/browse/MESOS-7605 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James DeFelice >Assignee: James Peach >Priority: Major > Labels: mesosphere > > Docker's {{run}} command supports a {{--hostname}} parameter which impacts > container isolation, even in {{host}} network mode: (via > https://docs.docker.com/engine/reference/run/) > {quote} > Even in host network mode a container has its own UTS namespace by default. > As such --hostname is allowed in host network mode and will only change the > hostname inside the container. Similar to --hostname, the --add-host, --dns, > --dns-search, and --dns-option options can be used in host network mode. > {quote} > I see no evidence that UCR offers a similar isolation capability. > Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was > initially added to support the Docker containerizer's use of the > {{--hostname}} Docker {{run}} flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8479) Document agent SIGUSR1 behavior.
[ https://issues.apache.org/jira/browse/MESOS-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8479: --- Summary: Document agent SIGUSR1 behavior. (was: Document agne SIGUSR1 behavior.) > Document agent SIGUSR1 behavior. > > > Key: MESOS-8479 > URL: https://issues.apache.org/jira/browse/MESOS-8479 > Project: Mesos > Issue Type: Bug > Components: agent, documentation >Reporter: James Peach >Priority: Major > > The agent enters shutdown when it receives {{SIGUSR1}}. We should document > what this means, the corresponding behavior and how operators are intended to > use this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8479) Document agne SIGUSR1 behavior.
James Peach created MESOS-8479: -- Summary: Document agne SIGUSR1 behavior. Key: MESOS-8479 URL: https://issues.apache.org/jira/browse/MESOS-8479 Project: Mesos Issue Type: Bug Components: agent, documentation Reporter: James Peach The agent enters shutdown when it receives {{SIGUSR1}}. We should document what this means, the corresponding behavior and how operators are intended to use this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7016) Make default AWAIT_* duration configurable
[ https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329715#comment-16329715 ] James Peach edited comment on MESOS-7016 at 1/22/18 4:08 PM: - | [r/65201|https://reviews.apache.org/r/65201] | Added a global DEFAULT_TEST_TIMEOUT variable. | | [r/65202|https://reviews.apache.org/*r/65202] | Adopted the libprocess `DEFAULT_TEST_TIMEOUT`. | was (Author: jamespeach): | [r/65201|https://reviews.apache.org/r/65201] | Added a global DEFAULT_TEST_TIMEOUT variable. | | [*r/65202|https://reviews.apache.org/*r/65202] | Adopted the libprocess `DEFAULT_TEST_TIMEOUT`. | > Make default AWAIT_* duration configurable > -- > > Key: MESOS-7016 > URL: https://issues.apache.org/jira/browse/MESOS-7016 > Project: Mesos > Issue Type: Improvement > Components: libprocess, test >Reporter: Benjamin Bannier >Assignee: James Peach >Priority: Major > Fix For: 1.6.0 > > > libprocess defines a number of helpers {{AWAIT_*}} to wait for a > {{process::Future}} reaching terminal states. These helpers are used in tests. > Currently the default duration to wait before triggering an assertion failure > is 15s. This value was chosen as a compromise between failing fast on likely > fast developer machines, but also allowing enough time for tests to pass in > high-contention environments (e.g., overbooked CI machines). > If a machine is more overloaded than expected, {{Futures}} might take longer > to reach the desired state, and tests could fail. Ultimately we should > consider running tests with paused clock to eliminate this source of test > flakiness, see MESOS-4101, but as an intermediate measure we should make the > default timeout duration configurable. > A simple approach might be to expose a build variable allowing users to set > at configure/cmake time a desired timeout duration for the setup they are > building for. This would allow us to define longer timeouts in the CI build > scripts, while keeping default timeouts as short as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329711#comment-16329711 ] James Peach edited comment on MESOS-6575 at 1/17/18 11:53 PM: -- Yeh, I think that using the soft limit is a pretty good idea. We can set the soft limit to the resources and the hard limit to resource + a fudge factor. We can kill applications based on either directly observing soft limit breaches, or the quota warnings (need to check whether XFS will reset them if the task goes back under the soft limit). We should think about how to make this behaviour configurable per-task, since I still want to support the non-destructive case for storage tasks that can manage their space. was (Author: jamespeach): Yeh, I think that using the soft limit is a pretty good idea. We can set the soft limit to the resources and the hard limit to resource + a fudge factor. We can kill applications based on either directly observing soft limit breaches, or the quota warnings (need to check whether XFS will reset them if the task goes back under the soft limit). > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329711#comment-16329711 ] James Peach commented on MESOS-6575: Yeh, I think that using the soft limit is a pretty good idea. We can set the soft limit to the resources and the hard limit to resource + a fudge factor. We can kill applications based on either directly observing soft limit breaches, or the quota warnings (need to check whether XFS will reset them if the task goes back under the soft limit). > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-6575: -- Assignee: James Peach > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7016) Make default AWAIT_* duration configurable
[ https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329588#comment-16329588 ] James Peach commented on MESOS-7016: I have most of a patch that adds a global variable for the default timeout to {{libprocess}} and a Mesos test suite flag to configure it. > Make default AWAIT_* duration configurable > -- > > Key: MESOS-7016 > URL: https://issues.apache.org/jira/browse/MESOS-7016 > Project: Mesos > Issue Type: Improvement > Components: libprocess, test >Reporter: Benjamin Bannier >Assignee: James Peach >Priority: Major > > libprocess defines a number of helpers {{AWAIT_*}} to wait for a > {{process::Future}} reaching terminal states. These helpers are used in tests. > Currently the default duration to wait before triggering an assertion failure > is 15s. This value was chosen as a compromise between failing fast on likely > fast developer machines, but also allowing enough time for tests to pass in > high-contention environments (e.g., overbooked CI machines). > If a machine is more overloaded than expected, {{Futures}} might take longer > to reach the desired state, and tests could fail. Ultimately we should > consider running tests with paused clock to eliminate this source of test > flakiness, see MESOS-4101, but as an intermediate measure we should make the > default timeout duration configurable. > A simple approach might be to expose a build variable allowing users to set > at configure/cmake time a desired timeout duration for the setup they are > building for. This would allow us to define longer timeouts in the CI build > scripts, while keeping default timeouts as short as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7016) Make default AWAIT_* duration configurable
[ https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-7016: -- Assignee: James Peach > Make default AWAIT_* duration configurable > -- > > Key: MESOS-7016 > URL: https://issues.apache.org/jira/browse/MESOS-7016 > Project: Mesos > Issue Type: Improvement > Components: libprocess, test >Reporter: Benjamin Bannier >Assignee: James Peach >Priority: Major > > libprocess defines a number of helpers {{AWAIT_*}} to wait for a > {{process::Future}} reaching terminal states. These helpers are used in tests. > Currently the default duration to wait before triggering an assertion failure > is 15s. This value was chosen as a compromise between failing fast on likely > fast developer machines, but also allowing enough time for tests to pass in > high-contention environments (e.g., overbooked CI machines). > If a machine is more overloaded than expected, {{Futures}} might take longer > to reach the desired state, and tests could fail. Ultimately we should > consider running tests with paused clock to eliminate this source of test > flakiness, see MESOS-4101, but as an intermediate measure we should make the > default timeout duration configurable. > A simple approach might be to expose a build variable allowing users to set > at configure/cmake time a desired timeout duration for the setup they are > building for. This would allow us to define longer timeouts in the CI build > scripts, while keeping default timeouts as short as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8440) `network/ports` isolator kills legitimate tasks on recovery.
James Peach created MESOS-8440: -- Summary: `network/ports` isolator kills legitimate tasks on recovery. Key: MESOS-8440 URL: https://issues.apache.org/jira/browse/MESOS-8440 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 1.5.0 Reporter: James Peach Assignee: James Peach At recovery time, the containerizer sends all the resources *except* the ports. This means that the ports check will race against the subsequent resources update. The root cause of this is that only the executor resources are provided at recovery time, whereas at update time the isolator gets the whole container resources as calculated by {{Executor::allocatedResources()}}. {noformat} I0112 08:22:23.930830 28937 linux_launcher.cpp:300] Recovered container 80a2d9dc-0492-4af5-a131-05f1cd66d672 I0112 08:22:23.931637 28933 ports.cpp:398] recovering container executor_info { executor_id { value: "fff42f68-4aed-4ca6-a62f-71b7166bbd7a" } resources { name: "cpus" type: SCALAR scalar { value: 0.1 } allocation_info { role: "*" } } resources { name: "mem" type: SCALAR scalar { value: 32 } allocation_info { role: "*" } } command { value: "/home/jpeach/src/mesos/build/src/mesos-executor" shell: false arguments: "mesos-executor" arguments: "--launcher_dir=/home/jpeach/src/mesos/build/src" } framework_id { value: "4ad59c30-7b1e-4991-bda2-e7f9275d3693-" } name: "Command Executor (Task: fff42f68-4aed-4ca6-a62f-71b7166bbd7a) (Command: sh -c \'nc -k -l 31446\')" source: "fff42f68-4aed-4ca6-a62f-71b7166bbd7a" } container_id { value: "80a2d9dc-0492-4af5-a131-05f1cd66d672" } pid: 28955 directory: "/tmp/NetworkPortsIsolatorTest_ROOT_NC_RecoverGoodTask_eTlVKl/slaves/4ad59c30-7b1e-4991-bda2-e7f9275d3693-S0/frameworks/4ad59c30-7b1e-4991-bda2-e7f9275d3693-/executors/fff42f68-4aed-4ca6-a62f-71b7166bbd7a/runs/80a2d9dc-0492-4af5-a131-05f1cd66d672" I0112 08:22:23.932137 28933 ports.cpp:530] Updated ports to [] for container 80a2d9dc-0492-4af5-a131-05f1cd66d672 I0112 08:22:23.932982 28937 provisioner.cpp:493] Provisioner recovery complete I0112 08:22:23.933924 28928 slave.cpp:6581] Sending reconnect request to executor 'fff42f68-4aed-4ca6-a62f-71b7166bbd7a' of framework 4ad59c30-7b1e-4991-bda2-e7f9275d3693- at executor(1)@17.228.224.108:42187 I0112 08:22:23.934587 28957 exec.cpp:282] Received reconnect request from agent 4ad59c30-7b1e-4991-bda2-e7f9275d3693-S0 I0112 08:22:23.935724 28931 slave.cpp:4426] Received re-registration message from executor 'fff42f68-4aed-4ca6-a62f-71b7166bbd7a' of framework 4ad59c30-7b1e-4991-bda2-e7f9275d3693- I0112 08:22:23.936646 28967 exec.cpp:259] Executor re-registered on agent 4ad59c30-7b1e-4991-bda2-e7f9275d3693-S0 I0112 08:22:23.936820 28929 ports.cpp:530] Updated ports to [31446-31446] for container 80a2d9dc-0492-4af5-a131-05f1cd66d672 {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8413) Zookeeper configuration passwords are shown in clear text
[ https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319625#comment-16319625 ] James Peach commented on MESOS-8413: There's a similar issue with URLs for the {{CommandInfo.URI}} message. IIRC when I looked into that, the problem was that there was no code to crack the credentials out of the URL, so it wasn't even clear that the URL credentials didn't just happen to work by accident. These passwords end up in log files. > Zookeeper configuration passwords are shown in clear text > - > > Key: MESOS-8413 > URL: https://issues.apache.org/jira/browse/MESOS-8413 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 >Reporter: Alexander Rojas >Assignee: Alexander Rojas > Labels: mesosphere, security > > No matter how one configures mesos, either by passing the ZooKeeper flags in > the command line or using a file, as follows: > {noformat} > ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master > --log_dir=/tmp/$USER/mesos/master/log > --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1 > {noformat} > {noformat} > echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > > /tmp/${USER}/mesos/zk_config.txt > ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master > --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt > {noformat} > both the logs and the results of the {{/flags}} endpoint will resolve to the > contents of the flags, i.e.: > {noformat} > I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="false" --authenticate_frameworks="false" > --authenticate_http_frameworks="false" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticators="crammd5" > --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --quorum="1" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="20secs" > --registry_strict="false" --require_agent_domain="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/home/user/mesos/build/../src/webui" > --work_dir="/tmp/user/mesos/master" > --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs" > {noformat} > {noformat} > HTTP/1.1 200 OK > Content-Encoding: gzip > Content-Length: 591 > Content-Type: application/json > Date: Mon, 08 Jan 2018 15:12:53 GMT > { > "flags": { > "agent_ping_timeout": "15secs", > "agent_reregister_timeout": "10mins", > "allocation_interval": "1secs", > "allocator": "HierarchicalDRF", > "authenticate_agents": "false", > "authenticate_frameworks": "false", > "authenticate_http_frameworks": "false", > "authenticate_http_readonly": "false", > "authenticate_http_readwrite": "false", > "authenticators": "crammd5", > "authorizers": "local", > "filter_gpu_resources": "true", > "framework_sorter": "drf", > "help": "false", > "hostname_lookup": "true", > "http_authenticators": "basic", > "initialize_driver_logging": "true", > "log_auto_initialize": "true", > "log_dir": "/tmp/user/mesos/master/log", > "logbufsecs": "0", > "logging_level": "INFO", > "max_agent_ping_timeouts": "5", > "max_completed_frameworks": "50", > "max_completed_tasks_per_framework": "1000", > "max_unreachable_tasks_per_framework": "1000", > "port": "5050", > "quiet": "false", > "quorum": "1", > "recovery_agent_removal_limit": "100%", > "registry": "replicated_log", > "registry_fetch_timeout": "1mins", > "registry_gc_interval": "15mins", > "registry_max_agent_age": "2weeks", > "registry_max_agent_count": "102400", > "registry_store_timeout": "20secs", > "registry_strict": "false", > "require_agent_domain": "false", > "root_submissions": "true", > "user_sorter": "drf", >
[jira] [Commented] (MESOS-8348) Enable function sections in the build.
[ https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318864#comment-16318864 ] James Peach commented on MESOS-8348: No apparent performance difference with a quick and arbitrary benchmark. *Without GC unused sections:* {noformat} [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 Starting reregistration for all agents Reregistered 2000 agents with a total of 10 running tasks and 10 completed tasks in 28.812622779secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 (60329 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 Starting reregistration for all agents Reregistered 2000 agents with a total of 20 running tasks and 0 completed tasks in 39.378296252secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 (98509 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 Starting reregistration for all agents Reregistered 2 agents with a total of 10 running tasks and 0 completed tasks in 45.240454686secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 (80371 ms) [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (239209 ms total) {noformat} *With GC unused sections:* {noformat} [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 Starting reregistration for all agents Reregistered 2000 agents with a total of 10 running tasks and 10 completed tasks in 28.751620417secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 (59282 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 Starting reregistration for all agents Reregistered 2000 agents with a total of 20 running tasks and 0 completed tasks in 40.010202034secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 (96938 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 Starting reregistration for all agents Reregistered 2 agents with a total of 10 running tasks and 0 completed tasks in 44.541095336secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 (79331 ms) [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (235551 ms total) {noformat} > Enable function sections in the build. > -- > > Key: MESOS-8348 > URL: https://issues.apache.org/jira/browse/MESOS-8348 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: James Peach >Assignee: James Peach > > Enable {{-ffunction-sections}} to improve the ability of the toolchain to > remove unused code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.
James Peach created MESOS-8410: -- Summary: Reconfiguration policy fails to handle mount disk resources. Key: MESOS-8410 URL: https://issues.apache.org/jira/browse/MESOS-8410 Project: Mesos Issue Type: Bug Reporter: James Peach We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos agents that had mount disk resources configured, and it looks like the agent confused the size of the mount disk with the size of the work directory resource: {noformat} E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to perform recovery: Configuration change not permitted under 'additive' policy: Value of scalar resource 'disk' decreased from 183 to 868000 {noformat} The {{--resources}} flag is {noformat} --resources="[ { "name": "disk", "type": "SCALAR", "scalar": { "value": 868000 } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/a" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/b" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/c" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/d" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/e" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/f" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/g" } } } } , { "name": "disk", "type": "SCALAR", "scalar": { "value": 183 }, "disk": { "source": { "type": "MOUNT", "mount": { "root" : "/srv/mesos/volumes/h" } } } } ] {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8404) Improve image puller error messages.
[ https://issues.apache.org/jira/browse/MESOS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8404: --- Description: Saw this error message from the local docker puller: {noformat} Failed to launch container: Failed to read manifest: Failed to open file: No such file or directory. {noformat} Two problems with this # The error message from {{os::read}} is too verbose # The error message from the puller doesn't tell it what it failed to read was: Saw this error message from the local docker puller: {noformat} Failed to launch container: Failed to read manifest: Failed to open file: No such file or directory. {noformat} Two problems with this # The error message from {os::read}} is too verbose # The error message from the puller doesn't tell it what it failed to read > Improve image puller error messages. > > > Key: MESOS-8404 > URL: https://issues.apache.org/jira/browse/MESOS-8404 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > Saw this error message from the local docker puller: > {noformat} > Failed to launch container: Failed to read manifest: Failed to open file: No > such file or directory. > {noformat} > Two problems with this > # The error message from {{os::read}} is too verbose > # The error message from the puller doesn't tell it what it failed to read -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8405) Update master task loss handling.
James Peach created MESOS-8405: -- Summary: Update master task loss handling. Key: MESOS-8405 URL: https://issues.apache.org/jira/browse/MESOS-8405 Project: Mesos Issue Type: Bug Reporter: James Peach >From [~agentvindo.dev] in [r/64940|https://reviews.apache.org/r/64940/]: {quote} Ideally, we want terminal but unacknowledged tasks to still be marked unreachable in some way, either via task state being TASK_UNREACHABLE or task being present in unreachableTasks. This allows, for example, the WebUI to not show sandbox links for unreachable tasks irrespective of whether they were terminal or not before going unreachable. But doing this is tricky for various reasons: --> updateTask() doesn't allow a terminal state to be transitioned to TASK_UNREACHABLE. Right now when we call updateTask for a terminal task, it adds TASK_UNREACHABLE status to Task.statuses and also sends it to operator API stream subscribers which looks incorrect. The fact that updateTask internally deals with already terminal tasks is a bad design decision in retrospect. I think the callers shouldn't call it for terminal tasks instead. --> It's not clear to our users what a completed task means. The intention was for this to hold a cache of terminal and acknowledged tasks for storing recent history. The users of the WebUI probably equate "Completed Tasks" to terminal tasks irrespective of their acknowledgement status, which is why it is confusing for them to see terminal but unacknowledged tasks in the "Active tasks" section in the WebUI. --> When a framework reconciles the state of a task on an unreachable agent, master replies with TASK_UNREACHABLE irrespective of whether the task was in a non-terminal state or terminal but un-acknowledged state or terminal and acknowledged state when the agent went unreachable. I think the direction we want to go towards is --> Completed tasks should consist of terminal unacknowledged and terminal acknowled tasks, likely in two different data structures. --> Unreachable tasks should consist of all non-complete tasks on an unreachable agent. All the tasks in this map should be in TASK_UNREACHABLE state. {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8404) Improve image puller error messages.
[ https://issues.apache.org/jira/browse/MESOS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8404: -- Assignee: James Peach > Improve image puller error messages. > > > Key: MESOS-8404 > URL: https://issues.apache.org/jira/browse/MESOS-8404 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > Saw this error message from the local docker puller: > {noformat} > Failed to launch container: Failed to read manifest: Failed to open file: No > such file or directory. > {noformat} > Two problems with this > # The error message from {os::read}} is too verbose > # The error message from the puller doesn't tell it what it failed to read -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8404) Improve image puller error messages.
James Peach created MESOS-8404: -- Summary: Improve image puller error messages. Key: MESOS-8404 URL: https://issues.apache.org/jira/browse/MESOS-8404 Project: Mesos Issue Type: Bug Components: agent Reporter: James Peach Priority: Minor Saw this error message from the local docker puller: {noformat} Failed to launch container: Failed to read manifest: Failed to open file: No such file or directory. {noformat} Two problems with this # The error message from {os::read}} is too verbose # The error message from the puller doesn't tell it what it failed to read -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8332) Narrow the container sandbox permissions.
[ https://issues.apache.org/jira/browse/MESOS-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312210#comment-16312210 ] James Peach edited comment on MESOS-8332 at 1/4/18 11:42 PM: - The Mesos {{user@}} list was notified of this change in [this thread| https://lists.apache.org/thread.html/3a3f932e946e7b4a603e9fcd7eb218a43b5885cd1d83ffd4ca310fe9@%3Cuser.mesos.apache.org%3E]. was (Author: jamespeach): The Mesos {{user@}} list was notified of this change in [this thread| https://lists.apache.org/thread.html/3a3f932e946e7b4a603e9fcd7eb218a43b5885cd1d83ffd4ca310fe9@%3Cuser.mesos.apache.org%3E] > Narrow the container sandbox permissions. > - > > Key: MESOS-8332 > URL: https://issues.apache.org/jira/browse/MESOS-8332 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > Sandboxes are currently created with 0755 permissions, which allows anyone > with local machine access to inspect their contents. We should make them 0750 > to limit access to the owning user and group. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8332) Narrow the container sandbox permissions.
[ https://issues.apache.org/jira/browse/MESOS-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312210#comment-16312210 ] James Peach commented on MESOS-8332: The Mesos {{user@}} list was notified of this change in [this thread| https://lists.apache.org/thread.html/3a3f932e946e7b4a603e9fcd7eb218a43b5885cd1d83ffd4ca310fe9@%3Cuser.mesos.apache.org%3E] > Narrow the container sandbox permissions. > - > > Key: MESOS-8332 > URL: https://issues.apache.org/jira/browse/MESOS-8332 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > Sandboxes are currently created with 0755 permissions, which allows anyone > with local machine access to inspect their contents. We should make them 0750 > to limit access to the owning user and group. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8368) Improve HTTP parser to support HTTP/2 messages.
[ https://issues.apache.org/jira/browse/MESOS-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308938#comment-16308938 ] James Peach commented on MESOS-8368: Probably we should implement [SSL_CTX_set_next_protos_advertised_cb |https://www.openssl.org/docs/man1.1.0/ssl/SSL_set_alpn_protos.html] and only advertise {{http/1.1}}. This ought to prevent HTTP/2 negotiation, though it seems pretty aggressing of curl to try HTTP/2 without an explicit negotiation. > Improve HTTP parser to support HTTP/2 messages. > --- > > Key: MESOS-8368 > URL: https://issues.apache.org/jira/browse/MESOS-8368 > Project: Mesos > Issue Type: Improvement >Reporter: Armand Grillet > > We currently use [http-parser|https://github.com/nodejs/http-parser] to parse > HTTP messages. This parser does not work with HTTP/2 requests and responses > which as an issue as curl enables HTTP/2 by default for HTTPS connections > since its version 7.47. > The issue has been discovered in some of our tests (e.g. > ProvisionerDockerTest) where it crashes with the message {{Failed to decode > HTTP responses: Decoding failed}}. See > [MESOS-8335|https://issues.apache.org/jira/browse/MESOS-8335] for more > details. > Possible long-term solutions: > * Upgrade the parser to be compatible with HTTP/2 messages. > [http-parser|https://github.com/nodejs/http-parser] has not been updated > regularly this past year in favor of > [nghttp2|https://github.com/nghttp2/nghttp2] which has a much broader scope. > [There is no equivalent of http-parser for HTTP/2 > yet|https://users.rust-lang.org/t/is-there-anything-similar-to-http-parser-but-for-http2/10721]. > * Test which version of curl is used at startup and report an error if the > version is >= 7.47 and the flag {{--http1.0}} is not used in curl (more > details regarding this flag are available > [here|https://curl.haxx.se/docs/manpage.html]. > In the meantime, we are upgrading our testing machines using a recent version > of curl to run with the flag {{--http1.0}} > ([MESOS-8335|https://issues.apache.org/jira/browse/MESOS-8335]). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8366) Replace the command executor with the default executor.
[ https://issues.apache.org/jira/browse/MESOS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305959#comment-16305959 ] James Peach edited comment on MESOS-8366 at 12/29/17 4:51 AM: -- Issues that I have found so far: # Tests that restart the agent are now required to specify a fixed {{slaveId}} # Tests that inspect the task sandbox need to now find the nested container sandbox # Tests are likely to require additional expectations (since both the executor and task containers might trigger them) # The IO Switchboard doesn't work in local mode, which breaks command checks. # Tests that depend on manipulating or intercepting protobuf messages from the executor (e.g. {{MasterTest.AgentRestartNoReregister}}) I fixed the `FetcherCacheTest` suite, leaving the following non-root test failures: {noformat} [==] 310 tests from 130 test cases ran. (367254 ms total) [ PASSED ] 292 tests. [ FAILED ] 18 tests, listed below: [ FAILED ] CommandExecutorCheckTest.CommandCheckTimeout [ FAILED ] ContainerLoggerTest.DefaultToSandbox [ FAILED ] FetcherCacheHttpTest.HttpCachedConcurrent [ FAILED ] FetcherTest.Unzip_ExtractFile [ FAILED ] HealthCheckTest.HealthyTask [ FAILED ] HealthCheckTest.CheckCommandTimeout [ FAILED ] MasterTest.AgentRestartNoReregister [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.Reboot, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RegisterDisconnectedSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.MultipleFrameworks, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.ShutdownUnregisteredExecutor [ FAILED ] SlaveTest.GetExecutorInfoForTaskWithContainer [ FAILED ] ContentType/AgentAPITest.GetState/1, where GetParam() = application/json [ FAILED ] ContentType/AgentAPITest.LaunchNestedContainerSessionUnauthorized/1, where GetParam() = application/json [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/2, where GetParam() = (1, 0) [ FAILED ] DiskResource/PersistentVolumeTest.DestroyPersistentVolumeMultipleTasks/0, where GetParam() = (0, 0) {noformat} was (Author: jamespeach): Issues that I have found so far: # Tests that restart the agent are now required to specify a fixed {{slaveId}} # Tests that inspect the task sandbox need to now find the nested container sandbox # Tests are likely to require additional expectations (since both the executor and task containers might trigger them) # The IO Switchboard doesn't work in local mode, which breaks command checks. I fixed the `FetcherCacheTest` suite, leaving the following non-root test failures: {noformat} [==] 310 tests from 130 test cases ran. (367254 ms total) [ PASSED ] 292 tests. [ FAILED ] 18 tests, listed below: [ FAILED ] CommandExecutorCheckTest.CommandCheckTimeout [ FAILED ] ContainerLoggerTest.DefaultToSandbox [ FAILED ] FetcherCacheHttpTest.HttpCachedConcurrent [ FAILED ] FetcherTest.Unzip_ExtractFile [ FAILED ] HealthCheckTest.HealthyTask [ FAILED ] HealthCheckTest.CheckCommandTimeout [ FAILED ] MasterTest.AgentRestartNoReregister [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.Reboot, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RegisterDisconnectedSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.MultipleFrameworks, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.ShutdownUnregisteredExecutor [ FAILED ] SlaveTest.GetExecutorInfoForTaskWithContainer [ FAILED ] ContentType/AgentAPITest.GetState/1, where GetParam() = application/json [ FAILED ] ContentType/AgentAPITest.LaunchNestedContainerSessionUnauthorized/1, where GetParam() = application/json [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/2, where GetParam() = (1, 0) [ FAILED ] DiskResource/PersistentVolumeTest.DestroyPersistentVolumeMultipleTasks/0, where GetParam() = (0, 0) {noformat} > Replace the command executor with the default executor. > --- > > Key: MESOS-8366 > URL: https://issues.apache.org/jira/browse/MESOS-8366 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Reporter: James Peach >Assignee:
[jira] [Commented] (MESOS-8366) Replace the command executor with the default executor.
[ https://issues.apache.org/jira/browse/MESOS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305959#comment-16305959 ] James Peach commented on MESOS-8366: Issues that I have found so far: # Tests that restart the agent are now required to specify a fixed {{slaveId}} # Tests that inspect the task sandbox need to now find the nested container sandbox # Tests are likely to require additional expectations (since both the executor and task containers might trigger them) # The IO Switchboard doesn't work in local mode, which breaks command checks. I fixed the `FetcherCacheTest` suite, leaving the following non-root test failures: {noformat} [==] 310 tests from 130 test cases ran. (367254 ms total) [ PASSED ] 292 tests. [ FAILED ] 18 tests, listed below: [ FAILED ] CommandExecutorCheckTest.CommandCheckTimeout [ FAILED ] ContainerLoggerTest.DefaultToSandbox [ FAILED ] FetcherCacheHttpTest.HttpCachedConcurrent [ FAILED ] FetcherTest.Unzip_ExtractFile [ FAILED ] HealthCheckTest.HealthyTask [ FAILED ] HealthCheckTest.CheckCommandTimeout [ FAILED ] MasterTest.AgentRestartNoReregister [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.Reboot, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RegisterDisconnectedSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.MultipleFrameworks, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.ShutdownUnregisteredExecutor [ FAILED ] SlaveTest.GetExecutorInfoForTaskWithContainer [ FAILED ] ContentType/AgentAPITest.GetState/1, where GetParam() = application/json [ FAILED ] ContentType/AgentAPITest.LaunchNestedContainerSessionUnauthorized/1, where GetParam() = application/json [ FAILED ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/2, where GetParam() = (1, 0) [ FAILED ] DiskResource/PersistentVolumeTest.DestroyPersistentVolumeMultipleTasks/0, where GetParam() = (0, 0) {noformat} > Replace the command executor with the default executor. > --- > > Key: MESOS-8366 > URL: https://issues.apache.org/jira/browse/MESOS-8366 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Reporter: James Peach >Assignee: James Peach > > We should use the default executor for all the cases that currently invoke > the command executor. This is a straightforward matter of implementing > `LaunchTask` in the default executor, and then fixing all the test > assumptions that this change will break. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8366) Replace the command executor with the default executor.
James Peach created MESOS-8366: -- Summary: Replace the command executor with the default executor. Key: MESOS-8366 URL: https://issues.apache.org/jira/browse/MESOS-8366 Project: Mesos Issue Type: Bug Components: agent, executor Reporter: James Peach Assignee: James Peach We should use the default executor for all the cases that currently invoke the command executor. This is a straightforward matter of implementing `LaunchTask` in the default executor, and then fixing all the test assumptions that this change will break. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8337) Invalid state transition attempted when agent is lost.
[ https://issues.apache.org/jira/browse/MESOS-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302898#comment-16302898 ] James Peach commented on MESOS-8337: [~jieyu] This is a blocker for 1.5. I have a wacky patch that needs some cleanup and analysis before I can post it. > Invalid state transition attempted when agent is lost. > -- > > Key: MESOS-8337 > URL: https://issues.apache.org/jira/browse/MESOS-8337 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: James Peach > > The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to > {{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that > was added in the same patch. > {noformat} > I1214 23:42:16.507931 22396 master.cpp:10155] Removing task > mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; > cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: > *):[31068-31068, 31069-31069, 31072-31072] of framework > afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent > daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 > (magent0006.xxx.com) > F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == > TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7643) The order of isolators provided in '--isolation' flag is not preserved and instead sorted alphabetically
[ https://issues.apache.org/jira/browse/MESOS-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302897#comment-16302897 ] James Peach commented on MESOS-7643: [~jieyu] RFC review here https://reviews.apache.org/r/62472/ > The order of isolators provided in '--isolation' flag is not preserved and > instead sorted alphabetically > > > Key: MESOS-7643 > URL: https://issues.apache.org/jira/browse/MESOS-7643 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.1.2, 1.2.0, 1.3.0 >Reporter: Michael Cherny >Assignee: James Peach >Priority: Critical > Labels: isolation > > According to documentation and comments in code the order of the entries in > the --isolation flag should specify the ordering of the isolators. > Specifically, the `create` and `prepare` calls for each isolator should run > serially in the order in which they appear in the --isolation flag, while the > `cleanup` call should be serialized in reverse order (with exception of > filesystem isolator which is always first). > But in fact, the isolators provided in '--isolation' flag are sorted > alphabetically. > That happens in [this line of > code|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L377]. > In this line use of 'set' is done (apparently instead of list or > vector) and set is a sorted container. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8348) Enable function sections in the build.
[ https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297792#comment-16297792 ] James Peach edited comment on MESOS-8348 at 12/20/17 2:18 AM: -- Tested on a 4CPU/8G VM, building without cache, {{GTEST_FILTER="" time make -j2 check}}. Without any settings: {noformat} 11517.45user 1028.58system 1:51:31elapsed 187%CPU (0avgtext+0avgdata 4823956maxresident)k 8710392inputs+83178080outputs (10126major+275942791minor)pagefaults 0swaps {noformat} With CXXFLAGS={{\-ffunction-sections \-fdata-sections}} and LDFLAGS={{\-Wl,\--gc-sections}}: {noformat} 9962.13user 893.62system 1:35:17elapsed 189%CPU (0avgtext+0avgdata 3923732maxresident)k 1994920inputs+38351264outputs (3577major+239138696minor)pagefaults 0swaps {noformat} The build time is improved, and the final linked objects are significantly smaller: || Artifact || Normal || GC sections || | src/.libs/libmesos-1.5.0.so| 766M | 274M| | src/mesos-agent| 6.5M | 1.6M| | src/mesos-cni-port-mapper | 1.8M | 65K| | src/mesos-containerizer| 2.7M | 477K| | src/mesos-default-executor | 13M | 4.6M| | src/mesos-docker-executor | 9.6M | 3.6M| | src/mesos-execute | 7.5M | 2.6M| | src/mesos-executor | 7.5M | 2.6M| | src/mesos-fetcher | 6.1M | 1.9M| | src/mesos-io-switchboard | 3.7M | 874K| | src/mesos-local| 4.8M | 1.4M| | src/mesos-log | 1.8M | 348K| | src/mesos-logrotate-logger | 4.7M | 1.6M| | src/mesos-master | 6.3M | 1.6M| | src/mesos-network-helper | 4.2M | 1.2M| | src/mesos-resolve | 2.7M | 642K| | src/mesos-tcp-connect | 2.3M | 630K| | src/mesos-tests| 557M | 89M| | src/mesos-usage| 3.0M | 955K| was (Author: jamespeach): Tested on a 4CPU/8G VM, building without cache, {{GTEST_FILTER="" time make -j2 check}}. Without any settings: {noformat} 11517.45user 1028.58system 1:51:31elapsed 187%CPU (0avgtext+0avgdata 4823956maxresident)k 8710392inputs+83178080outputs (10126major+275942791minor)pagefaults 0swaps {noformat} With CXXFLAGS={{-ffunction-sections -fdata-sections}} and LDFLAGS={{-Wl,--gc-sections}}: {noformat} 9962.13user 893.62system 1:35:17elapsed 189%CPU (0avgtext+0avgdata 3923732maxresident)k 1994920inputs+38351264outputs (3577major+239138696minor)pagefaults 0swaps {noformat} The build time is improved, and the final linked objects are significantly smaller: || Artifact || Normal || GC sections || | src/.libs/libmesos-1.5.0.so| 766M | 274M| | src/mesos-agent| 6.5M | 1.6M| | src/mesos-cni-port-mapper | 1.8M | 65K| | src/mesos-containerizer| 2.7M | 477K| | src/mesos-default-executor | 13M | 4.6M| | src/mesos-docker-executor | 9.6M | 3.6M| | src/mesos-execute | 7.5M | 2.6M| | src/mesos-executor | 7.5M | 2.6M| | src/mesos-fetcher | 6.1M | 1.9M| | src/mesos-io-switchboard | 3.7M | 874K| | src/mesos-local| 4.8M | 1.4M| | src/mesos-log | 1.8M | 348K| | src/mesos-logrotate-logger | 4.7M | 1.6M| | src/mesos-master | 6.3M | 1.6M| | src/mesos-network-helper | 4.2M | 1.2M| | src/mesos-resolve | 2.7M | 642K| | src/mesos-tcp-connect | 2.3M | 630K| | src/mesos-tests| 557M | 89M| | src/mesos-usage| 3.0M | 955K| > Enable function sections in the build. > -- > > Key: MESOS-8348 > URL: https://issues.apache.org/jira/browse/MESOS-8348 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: James Peach >Assignee: James Peach > > Enable {{-ffunction-sections}} to improve the ability of the toolchain to > remove unused code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8348) Enable function sections in the build.
[ https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297792#comment-16297792 ] James Peach commented on MESOS-8348: Tested on a 4CPU/8G VM, building without cache, {{GTEST_FILTER="" time make -j2 check}}. Without any settings: {noformat} 11517.45user 1028.58system 1:51:31elapsed 187%CPU (0avgtext+0avgdata 4823956maxresident)k 8710392inputs+83178080outputs (10126major+275942791minor)pagefaults 0swaps {noformat} With CXXFLAGS={{-ffunction-sections -fdata-sections}} and LDFLAGS={{-Wl,--gc-sections}}: {noformat} 9962.13user 893.62system 1:35:17elapsed 189%CPU (0avgtext+0avgdata 3923732maxresident)k 1994920inputs+38351264outputs (3577major+239138696minor)pagefaults 0swaps {noformat} The build time is improved, and the final linked objects are significantly smaller: || Artifact || Normal || GC sections || | src/.libs/libmesos-1.5.0.so| 766M | 274M| | src/mesos-agent| 6.5M | 1.6M| | src/mesos-cni-port-mapper | 1.8M | 65K| | src/mesos-containerizer| 2.7M | 477K| | src/mesos-default-executor | 13M | 4.6M| | src/mesos-docker-executor | 9.6M | 3.6M| | src/mesos-execute | 7.5M | 2.6M| | src/mesos-executor | 7.5M | 2.6M| | src/mesos-fetcher | 6.1M | 1.9M| | src/mesos-io-switchboard | 3.7M | 874K| | src/mesos-local| 4.8M | 1.4M| | src/mesos-log | 1.8M | 348K| | src/mesos-logrotate-logger | 4.7M | 1.6M| | src/mesos-master | 6.3M | 1.6M| | src/mesos-network-helper | 4.2M | 1.2M| | src/mesos-resolve | 2.7M | 642K| | src/mesos-tcp-connect | 2.3M | 630K| | src/mesos-tests| 557M | 89M| | src/mesos-usage| 3.0M | 955K| > Enable function sections in the build. > -- > > Key: MESOS-8348 > URL: https://issues.apache.org/jira/browse/MESOS-8348 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: James Peach >Assignee: James Peach > > Enable {{-ffunction-sections}} to improve the ability of the toolchain to > remove unused code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8348) Enable function sections in the build.
James Peach created MESOS-8348: -- Summary: Enable function sections in the build. Key: MESOS-8348 URL: https://issues.apache.org/jira/browse/MESOS-8348 Project: Mesos Issue Type: Bug Components: build Reporter: James Peach Enable {{-ffunction-sections}} to improve the ability of the toolchain to remove unused code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8348) Enable function sections in the build.
[ https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8348: -- Assignee: James Peach > Enable function sections in the build. > -- > > Key: MESOS-8348 > URL: https://issues.apache.org/jira/browse/MESOS-8348 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: James Peach >Assignee: James Peach > > Enable {{-ffunction-sections}} to improve the ability of the toolchain to > remove unused code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8340) Add a no-enforce isolation option.
[ https://issues.apache.org/jira/browse/MESOS-8340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293466#comment-16293466 ] James Peach commented on MESOS-8340: [~jieyu] Do you think this is reasonable? > Add a no-enforce isolation option. > -- > > Key: MESOS-8340 > URL: https://issues.apache.org/jira/browse/MESOS-8340 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach > > Some resource isolators ({{disk/du}}, {{disk/xfs}} and {{network/ports}}) > have the ability to run in a no-enforce mode where they report resource usage > but do not enforce the allocated resource limit. Rather than a separate flag > for each possibility, we could add an agent flag named > {{\-\-noenforce-isolation}} that causes the agent to log any limitation > raised by the given list of isolators, but would not cause the container to > be killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8340) Add a no-enforce isolation option.
James Peach created MESOS-8340: -- Summary: Add a no-enforce isolation option. Key: MESOS-8340 URL: https://issues.apache.org/jira/browse/MESOS-8340 Project: Mesos Issue Type: Bug Components: containerization Reporter: James Peach Some resource isolators ({{disk/du}}, {{disk/xfs}} and {{network/ports}}) have the ability to run in a no-enforce mode where they report resource usage but do not enforce the allocated resource limit. Rather than a separate flag for each possibility, we could add an agent flag named {{\-\-noenforce-isolation}} that causes the agent to log any limitation raised by the given list of isolators, but would not cause the container to be killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8337) Invalid state transition attempted when agent is lost.
[ https://issues.apache.org/jira/browse/MESOS-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8337: --- Summary: Invalid state transition attempted when agent is lost. (was: Invalid state transitions when agent is lost) > Invalid state transition attempted when agent is lost. > -- > > Key: MESOS-8337 > URL: https://issues.apache.org/jira/browse/MESOS-8337 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: James Peach > > The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to > {{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that > was added in the same patch. > {noformat} > I1214 23:42:16.507931 22396 master.cpp:10155] Removing task > mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; > cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: > *):[31068-31068, 31069-31069, 31072-31072] of framework > afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent > daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 > (magent0006.xxx.com) > F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == > TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8337) Invalid state transitions when agent is lost
James Peach created MESOS-8337: -- Summary: Invalid state transitions when agent is lost Key: MESOS-8337 URL: https://issues.apache.org/jira/browse/MESOS-8337 Project: Mesos Issue Type: Bug Components: master Reporter: James Peach The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to {{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that was added in the same patch. {noformat} I1214 23:42:16.507931 22396 master.cpp:10155] Removing task mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: *):[31068-31068, 31069-31069, 31072-31072] of framework afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 (magent0006.xxx.com) F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8332) Narrow the container sandbox permissions.
[ https://issues.apache.org/jira/browse/MESOS-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290181#comment-16290181 ] James Peach commented on MESOS-8332: In tests, I notice that {{chown}} on the executor sandbox path logs a warning but doesn't cause a failure, but {{chown}} on nested and standalone container paths fails the container. There might be some compatibility concern around making this behavior consistent since frameworks can currently be sloppy with their user names without failing. > Narrow the container sandbox permissions. > - > > Key: MESOS-8332 > URL: https://issues.apache.org/jira/browse/MESOS-8332 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > Sandboxes are currently created with 0755 permissions, which allows anyone > with local machine access to inspect their contents. We should make them 0750 > to limit access to the owning user and group. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8332) Narrow the container sandbox permissions.
James Peach created MESOS-8332: -- Summary: Narrow the container sandbox permissions. Key: MESOS-8332 URL: https://issues.apache.org/jira/browse/MESOS-8332 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach Assignee: James Peach Priority: Minor Sandboxes are currently created with 0755 permissions, which allows anyone with local machine access to inspect their contents. We should make them 0750 to limit access to the owning user and group. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8330) Document nested container ACLs
James Peach created MESOS-8330: -- Summary: Document nested container ACLs Key: MESOS-8330 URL: https://issues.apache.org/jira/browse/MESOS-8330 Project: Mesos Issue Type: Bug Components: containerization, documentation Reporter: James Peach None of the nested container ACLs are documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8306) Restrict which agents can statically reserve resources for which roles
[ https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289761#comment-16289761 ] James Peach commented on MESOS-8306: This approach depends on all the agents in a specific class registering with the same principal, right? That seems like a bad idea. > Restrict which agents can statically reserve resources for which roles > -- > > Key: MESOS-8306 > URL: https://issues.apache.org/jira/browse/MESOS-8306 > Project: Mesos > Issue Type: Improvement >Reporter: Yan Xu >Assignee: Yan Xu > > In some use cases part of a Mesos cluster could be reserved for certain > frameworks/roles. A common approach is to use static reservation so the > resources of an agent are only offered to frameworks of the designated roles. > However without proper authorization any (compromised) agent can register > with these special roles and accept workload from these frameworks. > We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} > is allowed to register with static reservation roles {{bar, baz}}; no other > principals are allowed to register with static reservation roles {{bar, baz}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8306) Restrict which agents can statically reserve resources for which roles
[ https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286509#comment-16286509 ] James Peach edited comment on MESOS-8306 at 12/11/17 9:26 PM: -- Can you be more specific about the proposal? I can't match your description up to the ACLs docs. was (Author: jamespeach): That generally sounds reasonable to me. I expect you want to mirror this into {{UnreserveResources}} for consistency. Think about how this could be extended, e.g. reserve only {{disk}} or {{cpu}} resources. > Restrict which agents can statically reserve resources for which roles > -- > > Key: MESOS-8306 > URL: https://issues.apache.org/jira/browse/MESOS-8306 > Project: Mesos > Issue Type: Improvement >Reporter: Yan Xu >Assignee: Yan Xu > > In some use cases part of a Mesos cluster could be reserved for certain > frameworks/roles. A common approach is to use static reservation so the > resources of an agent are only offered to frameworks of the designated roles. > However without proper authorization any (compromised) agent can register > with these special roles and accept workload from these frameworks. > We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} > is allowed to register with static reservation roles {{bar, baz}}; no other > principals are allowed to register with static reservation roles {{bar, baz}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8306) Restrict which agents can statically reserve resources for which roles
[ https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286509#comment-16286509 ] James Peach commented on MESOS-8306: That generally sounds reasonable to me. I expect you want to mirror this into {{UnreserveResources}} for consistency. Think about how this could be extended, e.g. reserve only {{disk}} or {{cpu}} resources. > Restrict which agents can statically reserve resources for which roles > -- > > Key: MESOS-8306 > URL: https://issues.apache.org/jira/browse/MESOS-8306 > Project: Mesos > Issue Type: Improvement >Reporter: Yan Xu >Assignee: Yan Xu > > In some use cases part of a Mesos cluster could be reserved for certain > frameworks/roles. A common approach is to use static reservation so the > resources of an agent are only offered to frameworks of the designated roles. > However without proper authorization any (compromised) agent can register > with these special roles and accept workload from these frameworks. > We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} > is allowed to register with static reservation roles {{bar, baz}}; no other > principals are allowed to register with static reservation roles {{bar, baz}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.
[ https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284945#comment-16284945 ] James Peach commented on MESOS-8317: The executor failed because it had older protobufs than the scheduler. It was using the JSON content type and the Go jsonpb package pukes if it receives a field that it doesn't know about. The field in question was the {{protocol}} field in the {{HealthCheck}} message. > Check failed when newly registered executor has launched tasks. > --- > > Key: MESOS-8317 > URL: https://issues.apache.org/jira/browse/MESOS-8317 > Project: Mesos > Issue Type: Bug >Reporter: James Peach > > This check in {{slave/slave.cpp}} can fail: > {code} >4105 if (state != RECOVERING && >4106 executor->queuedTasks.empty() && >4107 executor->queuedTaskGroups.empty()) { >4108 CHECK(executor->launchedTasks.empty()) >4109 << " Newly registered executor '" << executor->id >4110 << "' has launched tasks"; >4111 >4112 LOG(WARNING) << "Shutting down the executor " << *executor >4113 << " because it has no tasks to run"; >4114 >4115 _shutdownExecutor(framework, executor); >4116 >4117 return; >4118 } > {code} > This happens with the following sequence of events: > 1. HTTP executor subscribes > 2. Agent sends a LAUNCH message that the executor can't decode > 3. HTTP executor closes the channel and re-subscribes > 4. Agent hits the above check because the executor sends and empty task list > (it never understood the LAUNCH message), but the agent thinks that a task > should have been launched. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8317) Check failed when newly registered executor has launched tasks.
James Peach created MESOS-8317: -- Summary: Check failed when newly registered executor has launched tasks. Key: MESOS-8317 URL: https://issues.apache.org/jira/browse/MESOS-8317 Project: Mesos Issue Type: Bug Reporter: James Peach This check in {{slave/slave.cpp}} can fail: {code} 4105 if (state != RECOVERING && 4106 executor->queuedTasks.empty() && 4107 executor->queuedTaskGroups.empty()) { 4108 CHECK(executor->launchedTasks.empty()) 4109 << " Newly registered executor '" << executor->id 4110 << "' has launched tasks"; 4111 4112 LOG(WARNING) << "Shutting down the executor " << *executor 4113 << " because it has no tasks to run"; 4114 4115 _shutdownExecutor(framework, executor); 4116 4117 return; 4118 } {code} This happens with the following sequence of events: 1. HTTP executor subscribes 2. Agent sends a LAUNCH message that the executor can't decode 3. HTTP executor closes the channel and re-subscribes 4. Agent hits the above check because the executor sends and empty task list (it never understood the LAUNCH message), but the agent thinks that a task should have been launched. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.
[ https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284378#comment-16284378 ] James Peach commented on MESOS-8317: /cc [~vinodkone] > Check failed when newly registered executor has launched tasks. > --- > > Key: MESOS-8317 > URL: https://issues.apache.org/jira/browse/MESOS-8317 > Project: Mesos > Issue Type: Bug >Reporter: James Peach > > This check in {{slave/slave.cpp}} can fail: > {code} >4105 if (state != RECOVERING && >4106 executor->queuedTasks.empty() && >4107 executor->queuedTaskGroups.empty()) { >4108 CHECK(executor->launchedTasks.empty()) >4109 << " Newly registered executor '" << executor->id >4110 << "' has launched tasks"; >4111 >4112 LOG(WARNING) << "Shutting down the executor " << *executor >4113 << " because it has no tasks to run"; >4114 >4115 _shutdownExecutor(framework, executor); >4116 >4117 return; >4118 } > {code} > This happens with the following sequence of events: > 1. HTTP executor subscribes > 2. Agent sends a LAUNCH message that the executor can't decode > 3. HTTP executor closes the channel and re-subscribes > 4. Agent hits the above check because the executor sends and empty task list > (it never understood the LAUNCH message), but the agent thinks that a task > should have been launched. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8313) Provide a host namespace container supervisor.
[ https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282757#comment-16282757 ] James Peach commented on MESOS-8313: {quote} The other draw back is that we created another nanny process in addition to the one that'll perform pid 1 reaping. {quote} Right. Currently, the supervisor is optional and inside the container. In this proposal, there would always be a supervisor outside the container, though I think that the one inside the container would remain optional. > Provide a host namespace container supervisor. > -- > > Key: MESOS-8313 > URL: https://issues.apache.org/jira/browse/MESOS-8313 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach > Attachments: IMG_2629.JPG > > > After more investigation on user namespaces, the current implementation of > creating the container namespaces needs some adjustment before we can > implement user namespaces in a useable fashion. > The problems we need to address are: > 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace > to mount {{procfs}}. Currently, this prevents containers joining the host PID > namespace. The workaround is to always create a new container PID namespace > (as a child of the user namespace) with the {{namespaces/pid}} isolator. > 2. The containerized needs to hold {{CAP_SYS_ADMIN}} over the network > namespace to mount {{sysfs}}. There's no general workaround for this since we > can't generally require containers to not join the host network namespace. > 3. The containerizer can't enter a user namespace after entering the > {{chroot}}. This restriction makes the existing order of containerizer > operations impossible to remain in the case where we want the executor to be > in a new user namespace that has no children (i.e. to protect the container > from a privileged task). > After some discussion with [~jieyu], we believe that we can some most or all > of these issues by creating a new containerized supervisor that runs fully > outside the container and is responsible for constructing the roots mount > namespace, launching the containerized to enter the rest of the container, > and waiting on the entered process. > Since this new supervisor process is not running in the user namespace, it > will be able to construct the container rootfs in a new mount namespace > without user namespace restrictions. We can then clone a child to fully > create and enter container namespaces along with the prefabricated rootfs > mount namespace. > The only drawback to this approach is that the container's mount namespace > will be owned by the root user namespace rather than the container user > namespace. We are OK with this for now. > The plan here is to retain the existing {{mesos-containerizer launch}} > subcommand and add a new {{mesos-containerizer supervise}} subcommand, which > will be its parent process. This new subcommand will be used for the default > executor and custom executor code paths. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8142) Improve container security with user namespaces.
[ https://issues.apache.org/jira/browse/MESOS-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8142: --- Summary: Improve container security with user namespaces. (was: Improve container security with user namespaces) > Improve container security with user namespaces. > > > Key: MESOS-8142 > URL: https://issues.apache.org/jira/browse/MESOS-8142 > Project: Mesos > Issue Type: Improvement > Components: containerization, security >Reporter: James Peach >Assignee: James Peach > > As a first pass at supporting user namespaces, figure out how we can use them > to improve container security when running untrusted tasks. > This ticket is specifically targeting how to build a user namespace hierarchy > and excluding any sort of ID mapping for the container images. -- This message was sent by Atlassian JIRA (v6.4.14#64029)