[jira] [Created] (MESOS-9041) Break agent dependencies out of libmesos.

2018-06-30 Thread James Peach (JIRA)
James Peach created MESOS-9041:
--

 Summary: Break agent dependencies out of libmesos.
 Key: MESOS-9041
 URL: https://issues.apache.org/jira/browse/MESOS-9041
 Project: Mesos
  Issue Type: Task
  Components: agent, build
Reporter: James Peach


{{libmesos.so}} includes all the dependencies for both the master and the 
agent. This means that is has way more symbols than necessary (causing inflated 
built times), and drags in dependencies (e.g. libnl.so, libblkid.so) that are 
only necessary on the agent. We should attempt to separate the agent code out 
of {{libmesos.so}}, which would improve the build cleanliness and hopefully 
performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-06-30 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528651#comment-16528651
 ] 

James Peach commented on MESOS-9040:


/cc [~benjaminhindman]

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-06-30 Thread James Peach (JIRA)
James Peach created MESOS-9040:
--

 Summary: Break scheduler driver dependency on mesos-local.
 Key: MESOS-9040
 URL: https://issues.apache.org/jira/browse/MESOS-9040
 Project: Mesos
  Issue Type: Task
  Components: build, scheduler driver
Reporter: James Peach


The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
on the {{mesos-local}} code. This seems fairly hacky, but it also causes binary 
dependencies on {{src/local/local.cpp}} to be dragged into {{libmesos.so}}. 
{{libmesos.so}} would not otherwise require this code, which could be isolated 
in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9030) mock_slave.cpp fails to build with GCC 8.

2018-06-26 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524610#comment-16524610
 ] 

James Peach commented on MESOS-9030:


Verified that using googletest master doesn't fix this.

> mock_slave.cpp fails to build with GCC 8.
> -
>
> Key: MESOS-9030
> URL: https://issues.apache.org/jira/browse/MESOS-9030
> Project: Mesos
>  Issue Type: Task
>  Components: build, test
>Reporter: James Peach
>Priority: Major
>
> {noformat}
> In file included from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/tests/mock_slave.cpp:19:
> ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
> ‘process::Future::Future(const U&) [with U = testing::Matcher std::tuple&>&>; T = Nothing]’:
> /usr/include/c++/8/type_traits:932:12:   required from ‘struct 
> std::is_constructible&, testing::Matcher std::tuple&>&>&&>’
> /usr/include/c++/8/type_traits:138:12:   required from ‘struct 
> std::__and_&, 
> testing::Matcher&>&>&&> >’
> /usr/include/c++/8/tuple:485:68:   required from ‘static constexpr bool 
> std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements 
> = {testing::Matcher&>&>}; 
> bool  = true; _Elements = {const process::Future&}]’
> /usr/include/c++/8/tuple:641:59:   required by substitution of 
> ‘template sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const 
> process::Future&>::_NotSameTuple<_UElements ...>()), const 
> process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
> std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) 
> == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), 
> const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements 
> ...>()) && (1 >= 1)), bool>::type  > constexpr std::tuple process::Future&>::tuple(_UElements&& ...) [with _UElements = 
> {testing::Matcher&>&>}; 
> typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && 
> std::_TC<(sizeof... (_UElements) == 1), const 
> process::Future&>::_NotSameTuple<_UElements ...>()), const 
> process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
> std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) 
> == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), 
> const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements 
> ...>()) && (1 >= 1)), bool>::type  = 1]’
> ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:894:37:
>required from 
> ‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*,
>  const char*, int, const string&, const ArgumentMatcherTuple&) [with F = 
> void(const process::Future&); testing::internal::string = 
> std::__cxx11::basic_string; 
> testing::internal::TypedExpectation::ArgumentMatcherTuple = 
> std::tuple&> >]’
> ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
>required from ‘testing::internal::TypedExpectation& 
> testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, 
> const string&, const ArgumentMatcherTuple&) [with F = void(const 
> process::Future&); testing::internal::string = 
> std::__cxx11::basic_string; 
> testing::internal::FunctionMockerBase::ArgumentMatcherTuple = 
> std::tuple&> >]’
> ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43:
>required from ‘testing::internal::TypedExpectation& 
> testing::internal::MockSpec::InternalExpectedAt(const char*, int, const 
> char*, const char*) [with F = void(const process::Future&)]’
> ../../src/tests/mock_slave.cpp:139:3:   required from here
> ../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no 
> matching function for call to ‘process::Future::set(const 
> testing::Matcher&>&>&)’
>set(u);
>^~~
> ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: 
> ‘bool process::Future::set(const T&) [with T = Nothing]’
>  bool Future::set(const T& t)
>   ^
> ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note:   no known 
> conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘const Nothing&’
> ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: 
> ‘bool process::Future::set(T&&) [with T = Nothing]’
>  bool Future::set(T&& t)
>   ^
> ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note:   no known 
> conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘Nothing&&’
> ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
> ‘process::Future::Future(const U&) [with U = const 
> testing::MatcherInterface process::Future&>&>*; T = Nothing]’:
> 

[jira] [Commented] (MESOS-9030) mock_slave.cpp fails to build with GCC 8.

2018-06-26 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524348#comment-16524348
 ] 

James Peach commented on MESOS-9030:


{noformat}
$ gcc --version
gcc (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1)
{noformat}

> mock_slave.cpp fails to build with GCC 8.
> -
>
> Key: MESOS-9030
> URL: https://issues.apache.org/jira/browse/MESOS-9030
> Project: Mesos
>  Issue Type: Task
>  Components: build, test
>Reporter: James Peach
>Priority: Major
>
> {noformat}
> In file included from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/tests/mock_slave.cpp:19:
> ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
> ‘process::Future::Future(const U&) [with U = testing::Matcher std::tuple&>&>; T = Nothing]’:
> /usr/include/c++/8/type_traits:932:12:   required from ‘struct 
> std::is_constructible&, testing::Matcher std::tuple&>&>&&>’
> /usr/include/c++/8/type_traits:138:12:   required from ‘struct 
> std::__and_&, 
> testing::Matcher&>&>&&> >’
> /usr/include/c++/8/tuple:485:68:   required from ‘static constexpr bool 
> std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements 
> = {testing::Matcher&>&>}; 
> bool  = true; _Elements = {const process::Future&}]’
> /usr/include/c++/8/tuple:641:59:   required by substitution of 
> ‘template sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const 
> process::Future&>::_NotSameTuple<_UElements ...>()), const 
> process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
> std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) 
> == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), 
> const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements 
> ...>()) && (1 >= 1)), bool>::type  > constexpr std::tuple process::Future&>::tuple(_UElements&& ...) [with _UElements = 
> {testing::Matcher&>&>}; 
> typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && 
> std::_TC<(sizeof... (_UElements) == 1), const 
> process::Future&>::_NotSameTuple<_UElements ...>()), const 
> process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
> std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) 
> == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), 
> const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements 
> ...>()) && (1 >= 1)), bool>::type  = 1]’
> ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:894:37:
>required from 
> ‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*,
>  const char*, int, const string&, const ArgumentMatcherTuple&) [with F = 
> void(const process::Future&); testing::internal::string = 
> std::__cxx11::basic_string; 
> testing::internal::TypedExpectation::ArgumentMatcherTuple = 
> std::tuple&> >]’
> ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
>required from ‘testing::internal::TypedExpectation& 
> testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, 
> const string&, const ArgumentMatcherTuple&) [with F = void(const 
> process::Future&); testing::internal::string = 
> std::__cxx11::basic_string; 
> testing::internal::FunctionMockerBase::ArgumentMatcherTuple = 
> std::tuple&> >]’
> ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43:
>required from ‘testing::internal::TypedExpectation& 
> testing::internal::MockSpec::InternalExpectedAt(const char*, int, const 
> char*, const char*) [with F = void(const process::Future&)]’
> ../../src/tests/mock_slave.cpp:139:3:   required from here
> ../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no 
> matching function for call to ‘process::Future::set(const 
> testing::Matcher&>&>&)’
>set(u);
>^~~
> ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: 
> ‘bool process::Future::set(const T&) [with T = Nothing]’
>  bool Future::set(const T& t)
>   ^
> ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note:   no known 
> conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘const Nothing&’
> ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: 
> ‘bool process::Future::set(T&&) [with T = Nothing]’
>  bool Future::set(T&& t)
>   ^
> ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note:   no known 
> conversion for argument 1 from ‘const testing::Matcher process::Future&>&>’ to ‘Nothing&&’
> ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
> ‘process::Future::Future(const U&) [with U = const 
> testing::MatcherInterface process::Future&>&>*; T = 

[jira] [Created] (MESOS-9030) mock_slave.cpp fails to build with GCC 8.

2018-06-26 Thread James Peach (JIRA)
James Peach created MESOS-9030:
--

 Summary: mock_slave.cpp fails to build with GCC 8.
 Key: MESOS-9030
 URL: https://issues.apache.org/jira/browse/MESOS-9030
 Project: Mesos
  Issue Type: Task
  Components: build, test
Reporter: James Peach


{noformat}
In file included from 
../../include/mesos/authentication/secret_generator.hpp:22,
 from ../../src/tests/mock_slave.cpp:19:
../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
‘process::Future::Future(const U&) [with U = testing::Matcher&>&>; T = Nothing]’:
/usr/include/c++/8/type_traits:932:12:   required from ‘struct 
std::is_constructible&, testing::Matcher&>&>&&>’
/usr/include/c++/8/type_traits:138:12:   required from ‘struct 
std::__and_&, 
testing::Matcher&>&>&&> >’
/usr/include/c++/8/tuple:485:68:   required from ‘static constexpr bool 
std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements = 
{testing::Matcher&>&>}; bool 
 = true; _Elements = {const process::Future&}]’
/usr/include/c++/8/tuple:641:59:   required by substitution of ‘template&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 
1), const process::Future&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) 
&& (1 >= 1)), bool>::type  > constexpr std::tuple&>::tuple(_UElements&& ...) [with _UElements = 
{testing::Matcher&>&>}; 
typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && 
std::_TC<(sizeof... (_UElements) == 1), const 
process::Future&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 
1), const process::Future&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) 
&& (1 >= 1)), bool>::type  = 1]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:894:37:
   required from 
‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*,
 const char*, int, const string&, const ArgumentMatcherTuple&) [with F = 
void(const process::Future&); testing::internal::string = 
std::__cxx11::basic_string; 
testing::internal::TypedExpectation::ArgumentMatcherTuple = 
std::tuple&> >]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
   required from ‘testing::internal::TypedExpectation& 
testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, 
const string&, const ArgumentMatcherTuple&) [with F = void(const 
process::Future&); testing::internal::string = 
std::__cxx11::basic_string; 
testing::internal::FunctionMockerBase::ArgumentMatcherTuple = 
std::tuple&> >]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43:
   required from ‘testing::internal::TypedExpectation& 
testing::internal::MockSpec::InternalExpectedAt(const char*, int, const 
char*, const char*) [with F = void(const process::Future&)]’
../../src/tests/mock_slave.cpp:139:3:   required from here
../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no matching 
function for call to ‘process::Future::set(const 
testing::Matcher&>&>&)’
   set(u);
   ^~~
../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: 
‘bool process::Future::set(const T&) [with T = Nothing]’
 bool Future::set(const T& t)
  ^
../../3rdparty/libprocess/include/process/future.hpp:1761:6: note:   no known 
conversion for argument 1 from ‘const testing::Matcher&>&>’ to ‘const Nothing&’
../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: 
‘bool process::Future::set(T&&) [with T = Nothing]’
 bool Future::set(T&& t)
  ^
../../3rdparty/libprocess/include/process/future.hpp:1754:6: note:   no known 
conversion for argument 1 from ‘const testing::Matcher&>&>’ to ‘Nothing&&’
../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
‘process::Future::Future(const U&) [with U = const 
testing::MatcherInterface&>&>*; 
T = Nothing]’:
/usr/include/c++/8/type_traits:932:12:   required from ‘struct 
std::is_constructible&, const 
testing::MatcherInterface&>&>*&>’
/usr/include/c++/8/type_traits:138:12:   required from ‘struct 
std::__and_&, const 
testing::MatcherInterface&>&>*&> >’
/usr/include/c++/8/tuple:485:68:   required from ‘static constexpr bool 
std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements = 
{const testing::MatcherInterface&>&>*&}; bool  = true; _Elements = {const 
process::Future&}]’
/usr/include/c++/8/tuple:641:59:   required by substitution of ‘template&>::_NotSameTuple<_UElements ...>()), const 

[jira] [Commented] (MESOS-9021) Specify allowed devices for tasks

2018-06-22 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520626#comment-16520626
 ] 

James Peach commented on MESOS-9021:


Added link to design doc. This is basically straight forward, but we need to 
think through the security implications and the mechanism by which operators 
can apply access control.

> Specify allowed devices for tasks
> -
>
> Key: MESOS-9021
> URL: https://issues.apache.org/jira/browse/MESOS-9021
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: James Peach
>Priority: Minor
>
> Container devices can be specified globally, but not for specific tasks. We 
> should extend the API to allow schedulers to specify allowed devices for 
> particular tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9021) Specify allowed devices for tasks

2018-06-22 Thread James Peach (JIRA)
James Peach created MESOS-9021:
--

 Summary: Specify allowed devices for tasks
 Key: MESOS-9021
 URL: https://issues.apache.org/jira/browse/MESOS-9021
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: James Peach


Container devices can be specified globally, but not for specific tasks. We 
should extend the API to allow schedulers to specify allowed devices for 
particular tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9002) Mem access error in os::Fork::Tree

2018-06-15 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9002:
--

 Assignee: James Peach
 Priority: Minor  (was: Major)
Fix Version/s: 1.7.0

| [r/67614|https://reviews.apache.org/r/67614] | Removed memcpy from 
os::Fork::instantiate. |

> Mem access error in os::Fork::Tree
> --
>
> Key: MESOS-9002
> URL: https://issues.apache.org/jira/browse/MESOS-9002
> Project: Mesos
>  Issue Type: Task
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.7.0
>
>
> Building Mesos with gcc 8.1 (Fedora 28)
> {noformat}
> ../../3rdparty/stout/include/stout/os/posix/fork.hpp: In member function 
> ‘pid_t os::Fork::instantiate(const os::Fork::Tree&) const’:
> ../../3rdparty/stout/include/stout/os/posix/fork.hpp:354:61: error: ‘void* 
> memcpy(void*, const void*, size_t)’ writing to an object of type ‘using 
> element_type = std::remove_extent::type’ {aka ‘struct 
> os::Fork::Tree::Memory’} with no trivial copy-assignment 
> [-Werror=class-memaccess]
>  memcpy(tree.memory.get(), , sizeof(Tree::Memory));
>  ^
> ../../3rdparty/stout/include/stout/os/posix/fork.hpp:235:12: note: ‘using 
> element_type = std::remove_extent::type’ {aka ‘struct 
> os::Fork::Tree::Memory’} declared here
>  struct Memory {
> ^~
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9002) Mem access error in os::Fork::Tree

2018-06-15 Thread James Peach (JIRA)
James Peach created MESOS-9002:
--

 Summary: Mem access error in os::Fork::Tree
 Key: MESOS-9002
 URL: https://issues.apache.org/jira/browse/MESOS-9002
 Project: Mesos
  Issue Type: Task
Reporter: James Peach


Building Mesos with gcc 8.1 (Fedora 28)

{noformat}
../../3rdparty/stout/include/stout/os/posix/fork.hpp: In member function ‘pid_t 
os::Fork::instantiate(const os::Fork::Tree&) const’:
../../3rdparty/stout/include/stout/os/posix/fork.hpp:354:61: error: ‘void* 
memcpy(void*, const void*, size_t)’ writing to an object of type ‘using 
element_type = std::remove_extent::type’ {aka ‘struct 
os::Fork::Tree::Memory’} with no trivial copy-assignment 
[-Werror=class-memaccess]
 memcpy(tree.memory.get(), , sizeof(Tree::Memory));
 ^
../../3rdparty/stout/include/stout/os/posix/fork.hpp:235:12: note: ‘using 
element_type = std::remove_extent::type’ {aka ‘struct 
os::Fork::Tree::Memory’} declared here
 struct Memory {
^~
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.

2018-06-14 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513175#comment-16513175
 ] 

James Peach commented on MESOS-5158:


For CSI volumes, we can assume that the CSI plugin is enforcing quota and 
ignore it in the isolator. This means that if we call 
{{getPersistentVolumePath()}}, we have to verify that it is not  CSI volume 
beforehand.

> Provide XFS quota support for persistent volumes.
> -
>
> Key: MESOS-5158
> URL: https://issues.apache.org/jira/browse/MESOS-5158
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
>
> Given that the lifecycle of persistent volumes is managed outside of the 
> isolator, we may need to further abstract out the quota management 
> functionality to do it outside the XFS isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5158) Provide XFS quota support for persistent volumes.

2018-06-14 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-5158:
--

Assignee: James Peach

> Provide XFS quota support for persistent volumes.
> -
>
> Key: MESOS-5158
> URL: https://issues.apache.org/jira/browse/MESOS-5158
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
>
> Given that the lifecycle of persistent volumes is managed outside of the 
> isolator, we may need to further abstract out the quota management 
> functionality to do it outside the XFS isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-5158) Provide XFS quota support for persistent volumes.

2018-06-14 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513015#comment-16513015
 ] 

James Peach edited comment on MESOS-5158 at 6/14/18 9:27 PM:
-

Persistent volumes are managed in {{Slave::syncCheckpointedResources()}}, which 
will create new volumes and also delete old ones. The isolators are not 
notified about these changes. To support persistent volumes in the XFS 
isolators, we need to do a few things:
 # On recovery, we need to scan existing persistent volumes in order to recover 
the project IDs
 # On resources update, we need to notice any new persistent volumes and 
allocate a project ID for them
 # Periodically, we need to re-scan the persistent volumes to reclaim project 
IDs for volumes that have been deleted.
 # If we are doing active enforcement, we need to add the persistent volumes 
into the set of quotas that we are polling for usage. We need to consider which 
tasks would be killed if the volume is filled.

There's no explicit way to support the {{GROW_VOLUME}} or {{SHRINK_VOLUME}} 
operations since we would need to know how to update the quota when that 
happens. The agent doesn't explicitly grow the volume, it just updates its 
checkpointed resources. However, updating the quota when it is attached to a 
task would work, since the size of shared volumes cannot be altered.


was (Author: jamespeach):
Persistent volumes are managed in {{Slave::syncCheckpointedResources()}}, which 
will create new volumes and also delete old ones. The isolators are not 
notified about these changes. To support persistent volumes in the XFS 
isolators, we need to do a few things:

# On recovery, we need to scan existing persistent volumes in order to recover 
the project IDs
# On resources update, we need to notice any new persistent volumes and 
allocate a project ID for them
# Periodically, we need to re-scan the persistent volumes to reclaim project 
IDs for volumes that have been deleted.
# If we are doing active enforcement, we need to add the persistent volumes 
into the set of quotas that we are polling for usage. We need to consider which 
tasks would be killed if the volume is filled.

There's no explicit way to support the the {{GROW_VOLUME}} or {{SHRINK_VOLUME}} 
operations since we would need to know how to update the quota when that 
happens. The agent doesn't explicitly grow the volume, it just updates its 
checkpointed resources. However, updating the quota when it is attached to a 
task would work, since the size of shared volumes cannot be altered.

> Provide XFS quota support for persistent volumes.
> -
>
> Key: MESOS-5158
> URL: https://issues.apache.org/jira/browse/MESOS-5158
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Priority: Major
>
> Given that the lifecycle of persistent volumes is managed outside of the 
> isolator, we may need to further abstract out the quota management 
> functionality to do it outside the XFS isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.

2018-06-14 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513015#comment-16513015
 ] 

James Peach commented on MESOS-5158:


Persistent volumes are managed in {{Slave::syncCheckpointedResources()}}, which 
will create new volumes and also delete old ones. The isolators are not 
notified about these changes. To support persistent volumes in the XFS 
isolators, we need to do a few things:

# On recovery, we need to scan existing persistent volumes in order to recover 
the project IDs
# On resources update, we need to notice any new persistent volumes and 
allocate a project ID for them
# Periodically, we need to re-scan the persistent volumes to reclaim project 
IDs for volumes that have been deleted.
# If we are doing active enforcement, we need to add the persistent volumes 
into the set of quotas that we are polling for usage. We need to consider which 
tasks would be killed if the volume is filled.

There's no explicit way to support the the {{GROW_VOLUME}} or {{SHRINK_VOLUME}} 
operations since we would need to know how to update the quota when that 
happens. The agent doesn't explicitly grow the volume, it just updates its 
checkpointed resources. However, updating the quota when it is attached to a 
task would work, since the size of shared volumes cannot be altered.

> Provide XFS quota support for persistent volumes.
> -
>
> Key: MESOS-5158
> URL: https://issues.apache.org/jira/browse/MESOS-5158
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Priority: Major
>
> Given that the lifecycle of persistent volumes is managed outside of the 
> isolator, we may need to further abstract out the quota management 
> functionality to do it outside the XFS isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky

2018-05-24 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6823:
--

   Resolution: Fixed
 Assignee: Jie Yu
Fix Version/s: 1.7.0

{noformat}
commit 32d4305b87e79ed02cc686e0c29b027e31c6b3a4
Author: Jie Yu 
Date:   Thu May 24 10:05:17 2018 -0700

Adjusted the tests that use nobody.

Used `$SUDO_USER` instead because `nobody` sometimes cannot access
direcotries under `$HOME` of the current user running the tests.

Review: https://reviews.apache.org/r/67291
{noformat}

> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 
> is flaky
> --
>
> Key: MESOS-6823
> URL: https://issues.apache.org/jira/browse/MESOS-6823
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 12/14 both with/without SSL
>Reporter: Anand Mazumdar
>Assignee: Jie Yu
>Priority: Major
>  Labels: flaky, flaky-test, newbie
> Fix For: 1.7.0
>
>
> This showed up on our internal CI
> {code}
> [23:13:01] :   [Step 11/11] [ RUN  ] 
> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0
> [23:13:01] :   [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] 
> Creating default 'local' authorizer
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] 
> Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) 
> started on 172.16.10.213:45407
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" 
> --zk_session_timeout="10secs"
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] 
> Master only allowing authenticated agents to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] 
> Loading credentials for authentication from 
> '/mnt/teamcity/temp/buildTmp/ev3icd/credentials'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using 
> default 'crammd5' authenticator
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] 
> Authorization enabled
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654551 25733 
> whitelist_watcher.cpp:77] No whitelist given
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] 
> Initialized hierarchical allocator process
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] 
> Elected as the leading master!
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] 
> Recovering from registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] 
> Recovering registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] 
> Successfully fetched the registry (0B) in 

[jira] [Commented] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky

2018-05-16 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477709#comment-16477709
 ] 

James Peach commented on MESOS-6823:


Suggestion ... rather than execute as {{nobody}}, use 
{{os::getenv("SUDO_USER")}}.

> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 
> is flaky
> --
>
> Key: MESOS-6823
> URL: https://issues.apache.org/jira/browse/MESOS-6823
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 12/14 both with/without SSL
>Reporter: Anand Mazumdar
>Priority: Major
>  Labels: flaky, flaky-test, newbie
>
> This showed up on our internal CI
> {code}
> [23:13:01] :   [Step 11/11] [ RUN  ] 
> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0
> [23:13:01] :   [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] 
> Creating default 'local' authorizer
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] 
> Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) 
> started on 172.16.10.213:45407
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" 
> --zk_session_timeout="10secs"
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] 
> Master only allowing authenticated agents to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] 
> Loading credentials for authentication from 
> '/mnt/teamcity/temp/buildTmp/ev3icd/credentials'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using 
> default 'crammd5' authenticator
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] 
> Authorization enabled
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654551 25733 
> whitelist_watcher.cpp:77] No whitelist given
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] 
> Initialized hierarchical allocator process
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] 
> Elected as the leading master!
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] 
> Recovering from registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] 
> Recovering registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] 
> Successfully fetched the registry (0B) in 210944ns
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] 
> Applied 1 operations in 5006ns; attempting to update the registry
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] 
> Successfully updated the registry in 194048ns
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] 
> Successfully recovered 

[jira] [Commented] (MESOS-8897) ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky

2018-05-15 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476097#comment-16476097
 ] 

James Peach commented on MESOS-8897:


| [r67116|https://reviews.apache.org/r/67116/] | Change XFS Kill Test to use 
ASSERT_GE. |

> ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky
> -
>
> Key: MESOS-8897
> URL: https://issues.apache.org/jira/browse/MESOS-8897
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
>
> {noformat:title=}
> [ RUN ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill
> meta-data=/dev/loop0 isize=256 agcount=2, agsize=5120 blks
>  = sectsz=512 attr=2, projid32bit=1
>  = crc=0
> data = bsize=4096 blocks=10240, imaxpct=25
>  = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=1200, version=2
>  = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> I0508 17:55:12.353438 13453 exec.cpp:162] Version: 1.7.0
> I0508 17:55:12.370332 13451 exec.cpp:236] Executor registered on agent 
> 49668ffa-2a69-4867-b31a-4972b4ac13d2-S0
> I0508 17:55:12.376093 13447 executor.cpp:178] Received SUBSCRIBED event
> I0508 17:55:12.376771 13447 executor.cpp:182] Subscribed executor on 
> mesos.vagrant
> I0508 17:55:12.377038 13447 executor.cpp:178] Received LAUNCH event
> I0508 17:55:12.381901 13447 executor.cpp:665] Starting task 
> edb798b4-1b16-4de4-828c-0db132df70ab
> I0508 17:55:12.387936 13447 executor.cpp:485] Running 
> '/tmp/mesos-build/mesos/build/src/mesos-containerizer launch 
> '
> I0508 17:55:12.392854 13447 executor.cpp:678] Forked command at 13456
> 2+0 records in
> 2+0 records out
> 2097152 bytes (2.1 MB) copied, 0.00404074 s, 519 MB/s
> ../../src/tests/containerizer/xfs_quota_tests.cpp:618: Failure
> Expected: (limit.disk().get()) > (Megabytes(1)), actual: 1MB vs 1MB
> [ FAILED ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill (1182 ms)
> {noformat}
> [~jpe...@apache.org] mentioned that 
> {code}
> 409 // If the soft limit is exceeded the container should be killed.
> 410 if (quotaInfo->used > quotaInfo->softLimit) {
> 411   Resource resource;
> 412   resource.set_name("disk");
> 413   resource.set_type(Value::SCALAR);
> 414   resource.mutable_scalar()->set_value(
> 415 quotaInfo->used.bytes() / Bytes::MEGABYTES);
> 416
> 417   info->limitation.set(
> 418   protobuf::slave::createContainerLimitation(
> 419   Resources(resource),
> 420   "Disk usage (" + stringify(quotaInfo->used) +
> 421   ") exceeds quota (" +
> 422   stringify(quotaInfo->softLimit) + ")",
> 423   TaskStatus::REASON_CONTAINER_LIMITATION_DISK));
> 424 }
> 425   }
> {code}
> Converting to MB is rounding down, so we report less space than was actually 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8897) ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky

2018-05-15 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8897:
--

Assignee: James Peach

> ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky
> -
>
> Key: MESOS-8897
> URL: https://issues.apache.org/jira/browse/MESOS-8897
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
>
> {noformat:title=}
> [ RUN ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill
> meta-data=/dev/loop0 isize=256 agcount=2, agsize=5120 blks
>  = sectsz=512 attr=2, projid32bit=1
>  = crc=0
> data = bsize=4096 blocks=10240, imaxpct=25
>  = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=1200, version=2
>  = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> I0508 17:55:12.353438 13453 exec.cpp:162] Version: 1.7.0
> I0508 17:55:12.370332 13451 exec.cpp:236] Executor registered on agent 
> 49668ffa-2a69-4867-b31a-4972b4ac13d2-S0
> I0508 17:55:12.376093 13447 executor.cpp:178] Received SUBSCRIBED event
> I0508 17:55:12.376771 13447 executor.cpp:182] Subscribed executor on 
> mesos.vagrant
> I0508 17:55:12.377038 13447 executor.cpp:178] Received LAUNCH event
> I0508 17:55:12.381901 13447 executor.cpp:665] Starting task 
> edb798b4-1b16-4de4-828c-0db132df70ab
> I0508 17:55:12.387936 13447 executor.cpp:485] Running 
> '/tmp/mesos-build/mesos/build/src/mesos-containerizer launch 
> '
> I0508 17:55:12.392854 13447 executor.cpp:678] Forked command at 13456
> 2+0 records in
> 2+0 records out
> 2097152 bytes (2.1 MB) copied, 0.00404074 s, 519 MB/s
> ../../src/tests/containerizer/xfs_quota_tests.cpp:618: Failure
> Expected: (limit.disk().get()) > (Megabytes(1)), actual: 1MB vs 1MB
> [ FAILED ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill (1182 ms)
> {noformat}
> [~jpe...@apache.org] mentioned that 
> {code}
> 409 // If the soft limit is exceeded the container should be killed.
> 410 if (quotaInfo->used > quotaInfo->softLimit) {
> 411   Resource resource;
> 412   resource.set_name("disk");
> 413   resource.set_type(Value::SCALAR);
> 414   resource.mutable_scalar()->set_value(
> 415 quotaInfo->used.bytes() / Bytes::MEGABYTES);
> 416
> 417   info->limitation.set(
> 418   protobuf::slave::createContainerLimitation(
> 419   Resources(resource),
> 420   "Disk usage (" + stringify(quotaInfo->used) +
> 421   ") exceeds quota (" +
> 422   stringify(quotaInfo->softLimit) + ")",
> 423   TaskStatus::REASON_CONTAINER_LIMITATION_DISK));
> 424 }
> 425   }
> {code}
> Converting to MB is rounding down, so we report less space than was actually 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8913) Resource provider leaks file descriptors into executors.

2018-05-14 Thread James Peach (JIRA)
James Peach created MESOS-8913:
--

 Summary: Resource provider leaks file descriptors into executors.
 Key: MESOS-8913
 URL: https://issues.apache.org/jira/browse/MESOS-8913
 Project: Mesos
  Issue Type: Task
  Components: agent, security
Reporter: James Peach


I have an executor that closes unknown file descriptors when it starts up:

{noformat}
2018/05/14 20:54:43.210293 util_linux.go:65: closing extraneous fd 126 
(/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/08.log)
2018/05/14 20:54:43.210345 util_linux.go:47: unable to call fcntl() to get fd 
options for fd 3: errno bad file descriptor
2018/05/14 20:54:43.210385 util_linux.go:65: closing extraneous fd 321 
(/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/LOG)
2018/05/14 20:54:43.210438 util_linux.go:65: closing extraneous fd 322 
(/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/LOCK)
2018/05/14 20:54:43.210501 util_linux.go:65: closing extraneous fd 324 
(/srv/mesos/work/meta/slaves/30d57187-99b4-4e63-aba8-f425a80a6702-S8/resource_provider_registry/MANIFEST-06)
{noformat}

It is closing leveldb descriptors leaked by the resource provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8907) curl fetcher fails with HTTP/2

2018-05-10 Thread James Peach (JIRA)
James Peach created MESOS-8907:
--

 Summary: curl fetcher fails with HTTP/2
 Key: MESOS-8907
 URL: https://issues.apache.org/jira/browse/MESOS-8907
 Project: Mesos
  Issue Type: Task
  Components: fetcher
Reporter: James Peach


{noformat}
[ RUN  ] 
ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
...
I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
'quay.io/coreos/alpine-sh' from 
'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
'/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
'5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
HTTP responses: Decoding failed
HTTP/2 200
server: nginx/1.13.12
date: Fri, 11 May 2018 03:52:00 GMT
content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
content-length: 4486
docker-content-digest: 
sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
x-frame-options: DENY
strict-transport-security: max-age=63072000; preload
...
{noformat}

Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
prepared to parse that.

{noformat}
$ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
HTTP/1.1 301 Moved Permanently
Content-Type: text/html
Date: Fri, 11 May 2018 04:07:44 GMT
Location: https://quay.io/coreos/alpine-sh?latest
Server: nginx/1.13.12
Content-Length: 186
Connection: keep-alive

HTTP/2 301
server: nginx/1.13.12
date: Fri, 11 May 2018 04:07:45 GMT
content-type: text/html; charset=utf-8
content-length: 287
location: https://quay.io/coreos/alpine-sh/?latest
x-frame-options: DENY
strict-transport-security: max-age=63072000; preload
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8792) Automatically create whitelisted devices.

2018-05-10 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470558#comment-16470558
 ] 

James Peach commented on MESOS-8792:


As per design doc, the way forward on this is a new {{linux/devices}} isolator. 
The initial implementation will share the {{\-\-allowed_devices}} configuration 
flag so that it will automatically work in concert with the {{cgroups/devices}} 
isolator. However the mechanism is general enough that we can later build on it 
to enable per-container devices.

> Automatically create whitelisted devices.
> -
>
> Key: MESOS-8792
> URL: https://issues.apache.org/jira/browse/MESOS-8792
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When the operator configures the {{\-\-allowed_devices}} agent flag, the 
> devices cgroup is configured but the task still needs to actually create the 
> device node. This is awkward because the task might not have enough 
> capabilities to {{mknod}} and even if we wanted to grant the capabilities, 
> the application may need to be modified to make the right system calls.
> We should enhance the isolator and containerizer to automatically create 
> device nodes that have been whitelisted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-05-07 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466073#comment-16466073
 ] 

James Peach commented on MESOS-6575:


{noformat}
commit 081c3114fefa18c6acd1e884e6d6583232e30d5c
Author: Harold Dost 
Date:   Mon May 7 08:39:29 2018 -0700

Documented the `--xfs-kill-containers` flag.

Added a description of the `--xfs-kill-containers` flag to the
`disk/xfs` isolator page and listed it in the upgrade documentation.

Review: https://reviews.apache.org/r/66975/
{noformat}

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
> Fix For: 1.6.0
>
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8865) Suspicious enum value comparisons in scheduler Java bindings

2018-05-02 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8865:
--

Assignee: Benjamin Bannier

> Suspicious enum value comparisons in scheduler Java bindings
> 
>
> Key: MESOS-8865
> URL: https://issues.apache.org/jira/browse/MESOS-8865
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>
> Clang reports suspicious comparisons of enum values in the scheduler Java 
> bindings,
> {noformat}
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:563:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::SUBSCRIBE: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:576:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::TEARDOWN: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:581:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::ACCEPT: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:601:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::ACCEPT_INVERSE_OFFERS:
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:602:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::DECLINE_INVERSE_OFFERS:
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:603:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::SHUTDOWN: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:609:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::DECLINE: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:621:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::REVIVE: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:626:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::KILL: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:631:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' (aka 'const 
> mesos::v1::scheduler::Call_Type')) [clang-diagnostic-enum-compare-switch]
> case Call::ACKNOWLEDGE: {
>  ^
> /home/bbannier/src/mesos/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp:642:10:
>  warning: comparison of two values with different enumeration types in switch 
> statement ('::mesos::scheduler::Call_Type' and 'const 
> mesos::v1::scheduler::Call::Type' 

[jira] [Commented] (MESOS-8792) Automatically create whitelisted devices.

2018-05-02 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461446#comment-16461446
 ] 

James Peach commented on MESOS-8792:


I have some preliminary patches for this and have experimented a bit. The major 
conceptual problem here is that if we are creating the device nodes at the time 
when we construct the chroot, the process is already running in cgroups 
(specifically the devices cgroup). This means that the devices cgroup must 
allow the {{mknod}} permission; you can't just specify read+write devices.

> Automatically create whitelisted devices.
> -
>
> Key: MESOS-8792
> URL: https://issues.apache.org/jira/browse/MESOS-8792
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When the operator configures the {{\-\-allowed_devices}} agent flag, the 
> devices cgroup is configured but the task still needs to actually create the 
> device node. This is awkward because the task might not have enough 
> capabilities to {{mknod}} and even if we wanted to grant the capabilities, 
> the application may need to be modified to make the right system calls.
> We should enhance the isolator and containerizer to automatically create 
> device nodes that have been whitelisted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-04-30 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459045#comment-16459045
 ] 

James Peach commented on MESOS-6575:


| [/r/66173|https://reviews.apache.org/r/66173/] | Added test for `disk/xfs` 
container limitation. |
| [r/66001|https://reviews.apache.org/r/66001/]| Added soft limit and kill to 
`disk/xfs`. |

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
> Fix For: 1.6.0
>
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8792) Automatically create whitelisted devices.

2018-04-16 Thread James Peach (JIRA)
James Peach created MESOS-8792:
--

 Summary: Automatically create whitelisted devices.
 Key: MESOS-8792
 URL: https://issues.apache.org/jira/browse/MESOS-8792
 Project: Mesos
  Issue Type: Improvement
  Components: cgroups, containerization
Reporter: James Peach
Assignee: James Peach


When the operator configures the {{\-\-allowed_devices}} agent flag, the 
devices cgroup is configured but the task still needs to actually create the 
device node. This is awkward because the task might not have enough 
capabilities to {{mknod}} and even if we wanted to grant the capabilities, the 
application may need to be modified to make the right system calls.

We should enhance the isolator and containerizer to automatically create device 
nodes that have been whitelisted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8531) Some task status updates sent by the default executor don't contain a REASON.

2018-04-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430944#comment-16430944
 ] 

James Peach commented on MESOS-8531:


This refers to the status updates that are sent when the default executor tears 
down a task group in response to a single task failing. In slack, we discussed 
defining a separate reason field that would be used to make it more explicit 
that a particular task was killed because the group failed (in some sense).

> Some task status updates sent by the default executor don't contain a REASON.
> -
>
> Key: MESOS-8531
> URL: https://issues.apache.org/jira/browse/MESOS-8531
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.2.3, 1.3.1, 1.4.1, 1.5.0
>Reporter: Gastón Kleiman
>Priority: Major
>  Labels: default-executor, mesosphere
>
> The default executor doesn't set a reason when sending {{TASK_KILLING}}, 
> {{TASK_KILLED}},
>  and {{TASK_FAILED}} task status update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8763) Enable -Wshadow in the build.

2018-04-06 Thread James Peach (JIRA)
James Peach created MESOS-8763:
--

 Summary: Enable -Wshadow in the build.
 Key: MESOS-8763
 URL: https://issues.apache.org/jira/browse/MESOS-8763
 Project: Mesos
  Issue Type: Improvement
  Components: build
Reporter: James Peach


Shadowed variables are a source of confusion and bugs. We should enable 
{{-Wshadow}} and eliminated these permanently. We would need to solve the 
shadowing issues that we get from our 3rd party dependencies.

{noformat}
In file included from ../../src/common/protobuf_utils.cpp:28:
In file included from ../../include/mesos/slave/isolator.hpp:27:
In file included from ../../3rdparty/libprocess/include/process/dispatch.hpp:20:
../../3rdparty/libprocess/include/process/process.hpp:242:54: error: 
declaration shadows a field of 'process::ProcessBase' [-Werror,-Wshadow]
  void delegate(const std::string& name, const UPID& pid)
 ^
../../3rdparty/libprocess/include/process/process.hpp:488:8: note: previous 
declaration is here
  UPID pid;
   ^
In file included from ../../src/common/protobuf_utils.cpp:53:
In file included from ../../src/master/master.hpp:51:
../../3rdparty/libprocess/include/process/protobuf.hpp:460:12: error: 
declaration shadows a local variable [-Werror,-Wshadow]
{ Req* req = nullptr; google::protobuf::Message* m = req; (void)m; }
   ^
../../3rdparty/libprocess/include/process/protobuf.hpp:457:18: note: previous 
declaration is here
  const Req& req) const
 ^
In file included from ../../src/common/protobuf_utils.cpp:53:
In file included from ../../src/master/master.hpp:54:
In file included from 
../../3rdparty/libprocess/include/process/metrics/counter.hpp:19:
In file included from 
../../3rdparty/libprocess/include/process/metrics/metric.hpp:22:
In file included from 
../../3rdparty/libprocess/include/process/statistics.hpp:21:
../../3rdparty/libprocess/include/process/timeseries.hpp:106:24: error: 
declaration shadows a field of 'TimeSeries' [-Werror,-Wshadow]
std::vector values;
   ^
../../3rdparty/libprocess/include/process/timeseries.hpp:242:21: note: previous 
declaration is here
  std::map values;
^
In file included from ../../src/common/protobuf_utils.cpp:53:
In file included from ../../src/master/master.hpp:79:
In file included from ../../src/master/flags.hpp:36:
In file included from ../../src/messages/flags.hpp:30:
../../src/common/parse.hpp:119:35: error: declaration shadows a local variable 
[-Werror,-Wshadow]
   const JSON::Value& value,
  ^
../../src/common/parse.hpp:108:72: note: previous declaration is here
inline Try> parse(const std::string& value)
   ^
In file included from ../../src/common/protobuf_utils.cpp:53:
../../src/master/master.hpp:2983:46: error: declaration shadows a field of 
'mesos::internal::master::Role' [-Werror,-Wshadow]
auto allocatedTo = [](const std::string& role) {
 ^
../../src/master/master.hpp:2998:21: note: previous declaration is here
  const std::string role;
^
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8716) Freezer controller is not returned to thaw if task termination fails

2018-03-21 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408669#comment-16408669
 ] 

James Peach commented on MESOS-8716:


Here's a stack trace that is symptomatic of this problem:
{noformat}
2018-03-21T04:31:49.272492+00:00 mslave1218 kernel: [3969040.584460] Call Trace:
2018-03-21T04:31:49.272494+00:00 mslave1218 kernel: [3969040.587253]  
[] schedule+0x39/0x90
2018-03-21T04:31:49.283684+00:00 mslave1218 kernel: [3969040.592551]  
[] __refrigerator+0x4d/0x140
2018-03-21T04:31:49.283689+00:00 mslave1218 kernel: [3969040.598458]  
[] get_signal+0x36d/0x390
2018-03-21T04:31:49.294814+00:00 mslave1218 kernel: [3969040.604103]  
[] do_signal+0x20/0x130
2018-03-21T04:31:49.294820+00:00 mslave1218 kernel: [3969040.609576]  
[] ? freezing_slow_path+0x4d/0x80
2018-03-21T04:31:49.306702+00:00 mslave1218 kernel: [3969040.615939]  
[] ? SyS_wait4+0xa9/0xf0
2018-03-21T04:31:49.306706+00:00 mslave1218 kernel: [3969040.621495]  
[] ? is_current_pgrp_orphaned+0xe0/0xe0
2018-03-21T04:31:49.319554+00:00 mslave1218 kernel: [3969040.628358]  
[] do_notify_resume+0x58/0x70
2018-03-21T04:31:49.319559+00:00 mslave1218 kernel: [3969040.634351]  
[] int_signal+0x12/0x17
{noformat}

> Freezer controller is not returned to thaw if task termination fails
> 
>
> Key: MESOS-8716
> URL: https://issues.apache.org/jira/browse/MESOS-8716
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.3.2
>Reporter: Sargun Dhillon
>Priority: Major
>
> This issue is related to https://issues.apache.org/jira/browse/MESOS-8004. A 
> container may fail to terminate for a variety of reasons. One common reason 
> in our system is when containers rely on external storage, they run fsync 
> before exiting (fsync on SIGTERM). This makes it so that the termination can 
> timeout. 
>  
> Even though Mesos has sent the requisite kill signals, the task will never 
> terminate because the cgroup stays frozen. 
>  
> The intended behaviour should be that on failure to terminate, if the pids 
> isolator is running, pids.max should be set to 0, to prevent further 
> processes from being created, the cgroup should be walked and sigkilled, and 
> then thawed. Once the processes finish thawing, the kill signal will be 
> delivered, and processed, resulting in the container finally finishing,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6555) Namespace 'mnt' is not supported

2018-03-20 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388251#comment-16388251
 ] 

James Peach edited comment on MESOS-6555 at 3/20/18 4:53 PM:
-

| [r/66175|https://reviews.apache.org/r/66175] | Added isolator checks for 
namespaces support. |


was (Author: jamespeach):
| [r/65932|https://reviews.apache.org/r/65932] | Added a generic mechanism to 
check for isolator requirements. |

> Namespace 'mnt' is not supported
> 
>
> Key: MESOS-6555
> URL: https://issues.apache.org/jira/browse/MESOS-6555
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Affects Versions: 1.0.0, 1.2.3, 1.3.1, 1.4.1, 1.5.0
> Environment: suse11 sp3,  kernal: 3.0.101-0.47.71-default #1 SMP Thu 
> Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux 
>Reporter: AndyPang
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.6.0
>
>
> the same code run in debain os,kernal version is '4.1.0-0' is ok; while in 
> sus 11 sp3 it has error.
> {code:title=mesos-execute|borderStyle=solid}
> ./mesos-execute --command="sleep 100" --master=:xxx  --name=sleep 
> --docker_image=ubuntu
> I1105 11:26:21.090703 194814 scheduler.cpp:172] Version: 1.0.0
> I1105 11:26:21.092821 194837 scheduler.cpp:461] New master detected at 
> master@:xxx
> Subscribed with ID 'fdb8546d-ca11-4a51-a297-8401e53b7692-'
> Submitted task 'sleep' to agent 'fdb8546d-ca11-4a51-a297-8401e53b7692-S0'
> Received status update TASK_FAILED for task 'sleep'
>   message: 'Failed to launch container: Collect failed: Failed to setup 
> hostname and network files: Failed to enter the mount namespace of pid 
> 194976: Namespace 'mnt' is not supported
> ; Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-03-14 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8609:
--

Assignee: James Peach  (was: Zhitao Li)

> Create a metric to indicate how long agent takes to recover executors
> -
>
> Key: MESOS-8609
> URL: https://issues.apache.org/jira/browse/MESOS-8609
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>Assignee: James Peach
>Priority: Minor
>  Labels: Metrics, agent
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-03-14 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8609:
--

Assignee: Zhitao Li  (was: James Peach)

> Create a metric to indicate how long agent takes to recover executors
> -
>
> Key: MESOS-8609
> URL: https://issues.apache.org/jira/browse/MESOS-8609
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>  Labels: Metrics, agent
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-03-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393177#comment-16393177
 ] 

James Peach commented on MESOS-6575:


{quote}
I guess I don't understand the opposition to having the soft limit as in the 
current implementation the soft limit is being set, but it happens to be set to 
the exact amount as the hard limit. The advantage of the soft limit is that we 
don't have to keep track of how long has something been over the soft limit, we 
perform the system call which provides us a time when the grace period is over 
and once that occurs we can kill the application.
{quote}

My reasoning is that it doesn't matter how long the task has exceeded the 
allocated limit for. The `disk/du` isolator doesn't wait for you to be over the 
quota for any length of time - the task is terminated as soon as the violation 
is detected. It's certainly possible to set a different soft limit, but I can't 
see how it helps. The isolator still needs to poll on an interval and verify 
the used space.

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-03-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391804#comment-16391804
 ] 

James Peach commented on MESOS-6575:


> James Peach Would you be able to act as the shepherd for getting this patch 
> in?

Yes I can shepherd. However, I don't think that setting the soft limit is the 
right approach. I can't see a scenario where it is actually needed. If the 
isolator needs to poll (and it almost certainly does), then all it needs to do 
is to compare the actual disk usage against the allocated disk resource.

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389066#comment-16389066
 ] 

James Peach edited comment on MESOS-6918 at 3/7/18 6:01 AM:


{quote}
[~jamespeach], do you think it's feasible to target some of this work for 1.6?
{quote}


Yes I think it's doable.


was (Author: jamespeach):
> [~jamespeach], do you think it's feasible to target some of this work for 1.6?

Yes I think it's doable.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389066#comment-16389066
 ] 

James Peach commented on MESOS-6918:


> [~jamespeach], do you think it's feasible to target some of this work for 1.6?

Yes I think it's doable.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195412#comment-16195412
 ] 

James Peach edited comment on MESOS-6918 at 3/7/18 5:55 AM:


Summary from our discussion:
 - retain the existing {{Timer}} value that holds the duration of the last 
sample
 - capture total duration (monotonic sum) for {{Timers}} in their time series
 - capture total sample count for {{Timers}} in their time series
 - replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or 
something)


was (Author: jamespeach):
Summary from our discussion:

- retain the existing {{Timer}} value that holds the duration of the last sample
- capture total duration (monotonic sum) for {{Timer}}s in their time series
- capture total sample count for {{Timer}}s in their time series
- replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or 
something)

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6128) Make "re-register" vs. "reregister" consistent in the master

2018-03-06 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6128:
--

Assignee: James Peach

> Make "re-register" vs. "reregister" consistent in the master
> 
>
> Key: MESOS-6128
> URL: https://issues.apache.org/jira/browse/MESOS-6128
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: James Peach
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> Per discussion in https://reviews.apache.org/r/50705/, we sometimes use 
> "re-register" in comments and elsewhere we use "reregister". We should pick 
> one form and use it consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-03-01 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382948#comment-16382948
 ] 

James Peach commented on MESOS-6575:


{quote}
When the resource is updated in the xfs handler they are not tracked, but 
instead are added up.
{quote}

This is because the XFS isolator doesn't support path volumes so there's no 
need to track any paths. It might be interesting to refactor a unified way to 
tracking disk resource, as a prerequisite to any other XFS changes, but AFAICT 
that's not actually required here.

{quote}
The idea behind the "diff_bytes" would be that you'd take the hard limit of any 
given task and subtract that amount of bytes to create a soft_limit below the 
hard limit.
{quote}

Thinking about this some more, I'm not sure that we need to do anything with 
soft limits at all. Let's assume that we implement this for task sandboxes by 
applying a hard limit that is "disk_resource + some_constant_slop". We still 
need to have the isolator periodically check the usage in order to raise the 
limitation, so it doesn't really matter whether we have a soft limit. All we 
really need to do is check the current usage against the resource limit.

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8610) NsTest.SupportedNamespaces fails on CentOS7

2018-02-26 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8610:
--

   Assignee: James Peach
Component/s: test

| [r/65804|https://reviews.apache.org/r/65804] | Fixed a typo in the 
NsTest.SupportedNamespaces test. |

> NsTest.SupportedNamespaces fails on CentOS7
> ---
>
> Key: MESOS-8610
> URL: https://issues.apache.org/jira/browse/MESOS-8610
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 7
>Reporter: Jan Schlicht
>Assignee: James Peach
>Priority: Major
>  Labels: flaky-test
>
> Failed on a {{GLOG_v=1 src/mesos-tests --verbose}} run with
> {noformat}
> [ RUN  ] NsTest.SupportedNamespaces
> ../../src/tests/containerizer/ns_tests.cpp:119: Failure
> Value of: (ns::supported(n)).get()
>   Actual: false
> Expected: true
> Which is: true
> CLONE_NEWUSER
> ../../src/tests/containerizer/ns_tests.cpp:124: Failure
> Value of: (ns::supported(allNamespaces)).get()
>   Actual: false
> Expected: true
> Which is: true
> CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER
> [  FAILED  ] NsTest.SupportedNamespaces (0 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8559) Add a default disk resource flag option.

2018-02-21 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8559:
--

Assignee: (was: James Peach)

> Add a default disk resource flag option.
> 
>
> Key: MESOS-8559
> URL: https://issues.apache.org/jira/browse/MESOS-8559
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Minor
>
> Since in MESOS-8558 we are documenting the current semantics that an absent 
> disk resource means that the task has no disk usage restrictions, consider 
> adding a new agent flag that would let operators specify a default disk usage 
> amount for tasks that are launched without  any disk resource. Alternatively, 
> we could validate (on the master) that tasks always have a minimum resource 
> profile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8559) Add a default disk resource flag option.

2018-02-21 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8559:
--

Assignee: James Peach

> Add a default disk resource flag option.
> 
>
> Key: MESOS-8559
> URL: https://issues.apache.org/jira/browse/MESOS-8559
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> Since in MESOS-8558 we are documenting the current semantics that an absent 
> disk resource means that the task has no disk usage restrictions, consider 
> adding a new agent flag that would let operators specify a default disk usage 
> amount for tasks that are launched without  any disk resource. Alternatively, 
> we could validate (on the master) that tasks always have a minimum resource 
> profile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User

2018-02-15 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365805#comment-16365805
 ] 

James Peach commented on MESOS-8585:


Yeh, crashing in this case seems pretty unfortunate. Probably 
`createExecutorDirectory` should return an error and we should refactor the 
callers to be able to propagate that correctly.

> Agent Crashes When Ask to Start Task with Unknown User
> --
>
> Key: MESOS-8585
> URL: https://issues.apache.org/jira/browse/MESOS-8585
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Karsten
>Priority: Major
> Attachments: dcos-mesos-slave.service.1.gz, 
> dcos-mesos-slave.service.2.gz
>
>
> The Marathon team has an integration test that tries to start a task with an 
> unknown user. The test expects a \{{TASK_FAILED}}. However, we see 
> \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent 
> crashes and restarts.
>  
> {code}
>  783 2018-02-14 14:55:45: I0214 14:55:45.319974  6213 slave.cpp:2542] 
> Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for 
> framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
> 784 2018-02-14 14:55:45: I0214 14:55:45.320605  6213 paths.cpp:727] 
> Creating sandbox 
> '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
> 784 
> a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
>  for user 'bad'
> 785 2018-02-14 14:55:45: F0214 14:55:45.321131  6213 paths.cpp:735] 
> CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' 
> Failed to create executor directory '/var/lib/mesos/slave/
> 785 
> slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
> 785 66d4acc88'
> 786 2018-02-14 14:55:45: *** Check failure stack trace: ***
> 787 2018-02-14 14:55:45: @ 0x7f72033444ad  
> google::LogMessage::Fail()
> 788 2018-02-14 14:55:45: @ 0x7f72033462dd  
> google::LogMessage::SendToLog()
> 789 2018-02-14 14:55:45: @ 0x7f720334409c  
> google::LogMessage::Flush()
> 790 2018-02-14 14:55:45: @ 0x7f7203346bd9  
> google::LogMessageFatal::~LogMessageFatal()
> 791 2018-02-14 14:55:45: @ 0x56544ca378f9  
> _CheckFatal::~_CheckFatal()
> 792 2018-02-14 14:55:45: @ 0x7f720270f30d  
> mesos::internal::slave::paths::createExecutorDirectory()
> 793 2018-02-14 14:55:45: @ 0x7f720273812c  
> mesos::internal::slave::Framework::addExecutor()
> 794 2018-02-14 14:55:45: @ 0x7f7202753e35  
> mesos::internal::slave::Slave::__run()
> 795 2018-02-14 14:55:45: @ 0x7f7202764292  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
> 795 
> listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
> 795 
> 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_
> 796 2018-02-14 14:55:45: @ 0x7f72032a2b11  
> process::ProcessBase::consume()
> 797 2018-02-14 14:55:45: @ 0x7f72032b183c  
> process::ProcessManager::resume()
> 798 2018-02-14 14:55:45: @ 0x7f72032b6da6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 799 2018-02-14 14:55:45: @ 0x7f72005ced73  (unknown)
> 800 2018-02-14 14:55:45: @ 0x7f72000cf52c  (unknown)
> 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd  (unknown)
> 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, 
> code=killed, status=6/ABRT
> 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed 
> state.
> 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result 
> 'signal'.
> 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time 
> over, scheduling restart.
> 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel 
> agent.
> 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel 
> agent...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8558) Document semantics of absent disk resources

2018-02-09 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8558:
--

Assignee: James Peach

> Document semantics of absent disk resources
> ---
>
> Key: MESOS-8558
> URL: https://issues.apache.org/jira/browse/MESOS-8558
> Project: Mesos
>  Issue Type: Documentation
>  Components: containerization, documentation
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> In the Containerizer Working Group, we decided that we should simply document 
> the current semantics of how disk resources are enforced when schedulers 
> don't specify any disk resource for their tasks. We agreed that we should 
> simply document the current semantics where this results in a task with no 
> disk usage restrictions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8559) Add a default disk resource flag option.

2018-02-09 Thread James Peach (JIRA)
James Peach created MESOS-8559:
--

 Summary: Add a default disk resource flag option.
 Key: MESOS-8559
 URL: https://issues.apache.org/jira/browse/MESOS-8559
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


Since in MESOS-8558 we are documenting the current semantics that an absent 
disk resource means that the task has no disk usage restrictions, consider 
adding a new agent flag that would let operators specify a default disk usage 
amount for tasks that are launched without  any disk resource. Alternatively, 
we could validate (on the master) that tasks always have a minimum resource 
profile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8558) Document semantics of absent disk resources

2018-02-09 Thread James Peach (JIRA)
James Peach created MESOS-8558:
--

 Summary: Document semantics of absent disk resources
 Key: MESOS-8558
 URL: https://issues.apache.org/jira/browse/MESOS-8558
 Project: Mesos
  Issue Type: Documentation
  Components: containerization, documentation
Reporter: James Peach


In the Containerizer Working Group, we decided that we should simply document 
the current semantics of how disk resources are enforced when schedulers don't 
specify any disk resource for their tasks. We agreed that we should simply 
document the current semantics where this results in a task with no disk usage 
restrictions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8313) Provide a host namespace container supervisor.

2018-02-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358625#comment-16358625
 ] 

James Peach commented on MESOS-8313:


Note, this supervisor need to read all its children, as per MESOS-5893.

> Provide a host namespace container supervisor.
> --
>
> Key: MESOS-8313
> URL: https://issues.apache.org/jira/browse/MESOS-8313
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Attachments: IMG_2629.JPG
>
>
> After more investigation on user namespaces, the current implementation of 
> creating the container namespaces needs some adjustment before we can 
> implement user namespaces in a useable fashion.
> The problems we need to address are:
> 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace 
> to mount {{procfs}}. Currently, this prevents containers joining the host PID 
> namespace. The workaround is to always create a new container PID namespace 
> (as a child of the user namespace) with the {{namespaces/pid}} isolator.
> 2. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the network 
> namespace to mount {{sysfs}}. There's no general workaround for this since we 
> can't generally require containers to not join the host network namespace.
> 3. The containerizer can't enter a user namespace after entering the 
> {{chroot}}. This restriction makes the existing order of containerizer 
> operations impossible to remain in the case where we want the executor to be 
> in a new user namespace that has no children (i.e. to protect the container 
> from a privileged task).
> After some discussion with [~jieyu], we believe that we can some most or all 
> of these issues by creating a new containerized supervisor that runs fully 
> outside the container and is responsible for constructing the roots mount 
> namespace, launching the containerized to enter the rest of the container, 
> and waiting on the entered process.
> Since this new supervisor process is not running in the user namespace, it 
> will be able to construct the container rootfs in a new mount namespace 
> without user namespace restrictions. We can then clone a child to fully 
> create and enter container namespaces along with the prefabricated rootfs 
> mount namespace.
> The only drawback to this approach is that the container's mount namespace 
> will be owned by the root user namespace rather than the container user 
> namespace. We are OK with this for now.
> The plan here is to retain the existing {{mesos-containerizer launch}} 
> subcommand and add a new {{mesos-containerizer supervise}} subcommand, which 
> will be its parent process. This new subcommand will be used for the default 
> executor and custom executor code paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5893) mesos-executor should adopt and reap orphan child processes

2018-02-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358622#comment-16358622
 ] 

James Peach commented on MESOS-5893:


The host namespace supervisor tracked in MESOS-8313 will make itself a reaper 
and reap all container processes.

> mesos-executor should adopt and reap orphan child processes
> ---
>
> Key: MESOS-5893
> URL: https://issues.apache.org/jira/browse/MESOS-5893
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
> Environment: mesos compiled from git master ( 1.1.0 ) 
> {{../configure --enable-ssl --enable-libevent --prefix=/usr --enable-optimize 
> --enable-silent-rules --enable-xfs-disk-isolator}}
> isolators : 
> {{namespaces/pid,cgroups/cpu,cgroups/mem,filesystem/linux,docker/runtime,network/cni,docker/volume}}
>Reporter: Stéphane Cottin
>Priority: Major
>  Labels: containerizer
>
> mesos containerizer does not properly handle children death.
> discovered using marathon-lb, each topology update fork another haproxy,  the 
> old haproxy process should properly die after its last client connection is 
> terminated, but turn into a zombie.
> {noformat}
>  7716 ?Ssl0:00  |   \_ mesos-executor 
> --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox 
> --user=root --working_directory=/marathon-lb 
> --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491
>  7813 ?Ss 0:00  |   |   \_ sh -c /marathon-lb/run sse 
> --marathon https://marathon:8443 --auth-credentials user:pass --group 
> 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050
>  7823 ?S  0:00  |   |   |   \_ /bin/bash /marathon-lb/run sse 
> --marathon https://marathon:8443 --auth-credentials user:pass --group 
> external --ssl-certs /certs --max-serv-port-ip-per-task 20050
>  7827 ?S  0:00  |   |   |   \_ /usr/bin/runsv 
> /marathon-lb/service/haproxy
>  7829 ?S  0:00  |   |   |   |   \_ /bin/bash ./run
>  8879 ?S  0:00  |   |   |   |   \_ sleep 0.5
>  7828 ?Sl 0:00  |   |   |   \_ python3 
> /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config 
> /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload 
> /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 
> --auth-credentials user:pass --group external --max-serv-port-ip-per-task 
> 20050
>  7906 ?Zs 0:00  |   |   \_ [haproxy] 
>  8628 ?Zs 0:00  |   |   \_ [haproxy] 
>  8722 ?Ss 0:00  |   |   \_ haproxy -p /tmp/haproxy.pid -f 
> /marathon-lb/haproxy.cfg -D -sf 144 52
> {noformat}
> update: mesos-executor should be registered as a subreaper ( 
> http://man7.org/linux/man-pages/man2/prctl.2.html ) and propagate signals. 
> code sample: https://github.com/krallin/tini/blob/master/src/tini.c



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8547) Mount devpts with compatible defaults.

2018-02-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354330#comment-16354330
 ] 

James Peach commented on MESOS-8547:


Note to self - we should also set something like {{max=1024}} since otherwise 
the default max for devpts is 2^20, which seems unreasonably high for an 
untrusted container.

> Mount devpts with compatible defaults.
> --
>
> Key: MESOS-8547
> URL: https://issues.apache.org/jira/browse/MESOS-8547
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The Mesos containerizer mounts {{devpts}} with the following options:
> {noformat}
> newinstance,ptmxmode=0666
> {noformat}
> Some versions of glibc (e.g. 
> [2.17|https://github.com/bminor/glibc/blob/glibc-2.17/sysdeps/unix/grantpt.c#L158]
>  from CentOS 7) are hard-coded to expect that terminal devices are owned by 
> the {{tty}} group, so this causes containers that allocate TTYs to expect to 
> have to chown the TTY (see grantpt code in glibc).
> Docker uses the following {{devpts}} default:
> {noformat}
> Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", 
> "mode=0620", "gid=5"},
> {noformat}
> I can think of a number of options
> # hard-code the "gid=5" option
> # look up the "tty" group from the host
> # propagate the devpts mount options from the host
> # look up the "tty" group from the container
> # make it the operator's problem (i.e. add configuration)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8549) Notification program for manual intervention.

2018-02-06 Thread James Peach (JIRA)
James Peach created MESOS-8549:
--

 Summary: Notification program for manual intervention.
 Key: MESOS-8549
 URL: https://issues.apache.org/jira/browse/MESOS-8549
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: James Peach


If the Mesos agent needs manual intervention to start (e.g. because the 
resources or attributes changed), the agent will refuse to start. However, it's 
not that obvious to operational system what is happening because mostly they 
will just observe that the agent is down and not be able to describe why it is 
down. One way to address this is for the agent to execute a program when this 
happens. Operators could then specify a program that updates the agent state in 
any relevant systems, which would make it easier to take the appropriate 
actions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8547) Mount devpts with compatible defaults.

2018-02-05 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353190#comment-16353190
 ] 

James Peach commented on MESOS-8547:


[This LWN article|https://lwn.net/Articles/688809/] explains the background 
pretty well.

> Mount devpts with compatible defaults.
> --
>
> Key: MESOS-8547
> URL: https://issues.apache.org/jira/browse/MESOS-8547
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The Mesos containerizer mounts {{devpts}} with the following options:
> {noformat}
> newinstance,ptmxmode=0666
> {noformat}
> Some versions of glibc (e.g. 
> [2.17|https://github.com/bminor/glibc/blob/glibc-2.17/sysdeps/unix/grantpt.c#L158]
>  from CentOS 7) are hard-coded to expect that terminal devices are owned by 
> the {{tty}} group, so this causes containers that allocate TTYs to expect to 
> have to chown the TTY (see grantpt code in glibc).
> Docker uses the following {{devpts}} default:
> {noformat}
> Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", 
> "mode=0620", "gid=5"},
> {noformat}
> I can think of a number of options
> # hard-code the "gid=5" option
> # look up the "tty" group from the host
> # propagate the devpts mount options from the host
> # look up the "tty" group from the container
> # make it the operator's problem (i.e. add configuration)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8547) Mount devpts with compatible defaults.

2018-02-05 Thread James Peach (JIRA)
James Peach created MESOS-8547:
--

 Summary: Mount devpts with compatible defaults.
 Key: MESOS-8547
 URL: https://issues.apache.org/jira/browse/MESOS-8547
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: James Peach
Assignee: James Peach


The Mesos containerizer mounts {{devpts}} with the following options:

{noformat}
newinstance,ptmxmode=0666
{noformat}

Some versions of glibc (e.g. 
[2.17|https://github.com/bminor/glibc/blob/glibc-2.17/sysdeps/unix/grantpt.c#L158]
 from CentOS 7) are hard-coded to expect that terminal devices are owned by the 
{{tty}} group, so this causes containers that allocate TTYs to expect to have 
to chown the TTY (see grantpt code in glibc).

Docker uses the following {{devpts}} default:
{noformat}
Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", 
"mode=0620", "gid=5"},
{noformat}

I can think of a number of options

# hard-code the "gid=5" option
# look up the "tty" group from the host
# propagate the devpts mount options from the host
# look up the "tty" group from the container
# make it the operator's problem (i.e. add configuration)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8313) Provide a host namespace container supervisor.

2018-02-01 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8313:
---
Description: 
After more investigation on user namespaces, the current implementation of 
creating the container namespaces needs some adjustment before we can implement 
user namespaces in a useable fashion.

The problems we need to address are:

1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace to 
mount {{procfs}}. Currently, this prevents containers joining the host PID 
namespace. The workaround is to always create a new container PID namespace (as 
a child of the user namespace) with the {{namespaces/pid}} isolator.

2. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the network namespace 
to mount {{sysfs}}. There's no general workaround for this since we can't 
generally require containers to not join the host network namespace.

3. The containerizer can't enter a user namespace after entering the 
{{chroot}}. This restriction makes the existing order of containerizer 
operations impossible to remain in the case where we want the executor to be in 
a new user namespace that has no children (i.e. to protect the container from a 
privileged task).

After some discussion with [~jieyu], we believe that we can some most or all of 
these issues by creating a new containerized supervisor that runs fully outside 
the container and is responsible for constructing the roots mount namespace, 
launching the containerized to enter the rest of the container, and waiting on 
the entered process.

Since this new supervisor process is not running in the user namespace, it will 
be able to construct the container rootfs in a new mount namespace without user 
namespace restrictions. We can then clone a child to fully create and enter 
container namespaces along with the prefabricated rootfs mount namespace.

The only drawback to this approach is that the container's mount namespace will 
be owned by the root user namespace rather than the container user namespace. 
We are OK with this for now.

The plan here is to retain the existing {{mesos-containerizer launch}} 
subcommand and add a new {{mesos-containerizer supervise}} subcommand, which 
will be its parent process. This new subcommand will be used for the default 
executor and custom executor code paths.

  was:
After more investigation on user namespaces, the current implementation of 
creating the container namespaces needs some adjustment before we can implement 
user namespaces in a useable fashion.

The problems we need to address are:

1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace to 
mount {{procfs}}. Currently, this prevents containers joining the host PID 
namespace. The workaround is to always create a new container PID namespace (as 
a child of the user namespace) with the {{namespaces/pid}} isolator.

2. The containerized needs to hold {{CAP_SYS_ADMIN}} over the network namespace 
to mount {{sysfs}}. There's no general workaround for this since we can't 
generally require containers to not join the host network namespace.

3. The containerizer can't enter a user namespace after entering the 
{{chroot}}. This restriction makes the existing order of containerizer 
operations impossible to remain in the case where we want the executor to be in 
a new user namespace that has no children (i.e. to protect the container from a 
privileged task).

After some discussion with [~jieyu], we believe that we can some most or all of 
these issues by creating a new containerized supervisor that runs fully outside 
the container and is responsible for constructing the roots mount namespace, 
launching the containerized to enter the rest of the container, and waiting on 
the entered process.

Since this new supervisor process is not running in the user namespace, it will 
be able to construct the container rootfs in a new mount namespace without user 
namespace restrictions. We can then clone a child to fully create and enter 
container namespaces along with the prefabricated rootfs mount namespace.

The only drawback to this approach is that the container's mount namespace will 
be owned by the root user namespace rather than the container user namespace. 
We are OK with this for now.

The plan here is to retain the existing {{mesos-containerizer launch}} 
subcommand and add a new {{mesos-containerizer supervise}} subcommand, which 
will be its parent process. This new subcommand will be used for the default 
executor and custom executor code paths.


> Provide a host namespace container supervisor.
> --
>
> Key: MESOS-8313
> URL: https://issues.apache.org/jira/browse/MESOS-8313
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: 

[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking

2018-02-01 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348971#comment-16348971
 ] 

James Peach commented on MESOS-7605:


{quote}
Qian Zhang That is exactly not the point of this change. CNI already supports 
setting the container hostname as for all containers that have an image. The 
point of this isolator is to guarantee that the host's UTS namespace is 
protected from containers (case 1) above. I kept it explicitly out of scope for 
this isolator to actually set the hostname, since last time I did that, we 
ended up moving that feature to the CNI isolator.
{quote}

I believed that the CNI isolator did set up the hostname correctly when joining 
the host network, however [~qianzhang] is right that the CNI isolator doesn't 
clone the UTS namespace unless you join a named network.

So I agree with [~qianzhang] that we should make the CNI isolator clone the UTS 
namespace (and set the hostname) when it joins the host network and has a 
container image. We will still need the UTS isolator for the case where there 
is not a container image or the CNI isolator isn't used however.

IIRC [~avinash.mesos]'s original concern about this was that the specified 
hostname would not be consistent with DNS. There's 2 things we can do about 
this ... (1) just accept it and it's fine, (2) resolve the host's hostname and 
use that IP address to populate the container {{resolv.conf}}. AFAICT, Docker 
just does (1).

> UCR doesn't isolate uts namespace w/ host networking
> 
>
> Key: MESOS-7605
> URL: https://issues.apache.org/jira/browse/MESOS-7605
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James DeFelice
>Assignee: James Peach
>Priority: Major
>  Labels: mesosphere
>
> Docker's {{run}} command supports a {{--hostname}} parameter which impacts 
> container isolation, even in {{host}} network mode: (via 
> https://docs.docker.com/engine/reference/run/)
> {quote}
> Even in host network mode a container has its own UTS namespace by default. 
> As such --hostname is allowed in host network mode and will only change the 
> hostname inside the container. Similar to --hostname, the --add-host, --dns, 
> --dns-search, and --dns-option options can be used in host network mode.
> {quote}
> I see no evidence that UCR offers a similar isolation capability.
> Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was 
> initially added to support the Docker containerizer's use of the 
> {{--hostname}} Docker {{run}} flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking

2018-02-01 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348786#comment-16348786
 ] 

James Peach commented on MESOS-7605:


[~qianzhang] That is exactly not the point of this change. CNI already supports 
setting the container hostname as for all containers that have an image. The 
point of this isolator is to guarantee that the host's UTS namespace is 
protected from containers (case 1) above. I kept it explicitly out of scope for 
this isolator to actually set the hostname, since last time I did that, we 
ended up moving that feature to the CNI isolator.

> UCR doesn't isolate uts namespace w/ host networking
> 
>
> Key: MESOS-7605
> URL: https://issues.apache.org/jira/browse/MESOS-7605
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James DeFelice
>Assignee: James Peach
>Priority: Major
>  Labels: mesosphere
>
> Docker's {{run}} command supports a {{--hostname}} parameter which impacts 
> container isolation, even in {{host}} network mode: (via 
> https://docs.docker.com/engine/reference/run/)
> {quote}
> Even in host network mode a container has its own UTS namespace by default. 
> As such --hostname is allowed in host network mode and will only change the 
> hostname inside the container. Similar to --hostname, the --add-host, --dns, 
> --dns-search, and --dns-option options can be used in host network mode.
> {quote}
> I see no evidence that UCR offers a similar isolation capability.
> Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was 
> initially added to support the Docker containerizer's use of the 
> {{--hostname}} Docker {{run}} flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8518) Make lost agent notifications optional for frameworks.

2018-01-31 Thread James Peach (JIRA)
James Peach created MESOS-8518:
--

 Summary: Make lost agent notifications optional for frameworks.
 Key: MESOS-8518
 URL: https://issues.apache.org/jira/browse/MESOS-8518
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: James Peach


When an agent is lost, not all frameworks really care, but there can be 
undesirable performance effect by suddenly sending a ton of messages all at 
one. Consider some mechanism for a framework to express that is doesn't care 
about the agent states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking

2018-01-29 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343766#comment-16343766
 ] 

James Peach commented on MESOS-7605:


[~jdef], [~qianzhang], [~avinash.mesos] Can any of you help review?

> UCR doesn't isolate uts namespace w/ host networking
> 
>
> Key: MESOS-7605
> URL: https://issues.apache.org/jira/browse/MESOS-7605
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James DeFelice
>Assignee: James Peach
>Priority: Major
>  Labels: mesosphere
>
> Docker's {{run}} command supports a {{--hostname}} parameter which impacts 
> container isolation, even in {{host}} network mode: (via 
> https://docs.docker.com/engine/reference/run/)
> {quote}
> Even in host network mode a container has its own UTS namespace by default. 
> As such --hostname is allowed in host network mode and will only change the 
> hostname inside the container. Similar to --hostname, the --add-host, --dns, 
> --dns-search, and --dns-option options can be used in host network mode.
> {quote}
> I see no evidence that UCR offers a similar isolation capability.
> Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was 
> initially added to support the Docker containerizer's use of the 
> {{--hostname}} Docker {{run}} flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8479) Document agent SIGUSR1 behavior.

2018-01-23 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8479:
---
Summary: Document agent SIGUSR1 behavior.  (was: Document agne SIGUSR1 
behavior.)

> Document agent SIGUSR1 behavior.
> 
>
> Key: MESOS-8479
> URL: https://issues.apache.org/jira/browse/MESOS-8479
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, documentation
>Reporter: James Peach
>Priority: Major
>
> The agent enters shutdown when it receives {{SIGUSR1}}. We should document 
> what this means, the corresponding behavior and how operators are intended to 
> use this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8479) Document agne SIGUSR1 behavior.

2018-01-23 Thread James Peach (JIRA)
James Peach created MESOS-8479:
--

 Summary: Document agne SIGUSR1 behavior.
 Key: MESOS-8479
 URL: https://issues.apache.org/jira/browse/MESOS-8479
 Project: Mesos
  Issue Type: Bug
  Components: agent, documentation
Reporter: James Peach


The agent enters shutdown when it receives {{SIGUSR1}}. We should document what 
this means, the corresponding behavior and how operators are intended to use 
this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7016) Make default AWAIT_* duration configurable

2018-01-22 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329715#comment-16329715
 ] 

James Peach edited comment on MESOS-7016 at 1/22/18 4:08 PM:
-

| [r/65201|https://reviews.apache.org/r/65201] | Added a global 
DEFAULT_TEST_TIMEOUT variable. |
| [r/65202|https://reviews.apache.org/*r/65202] | Adopted the libprocess 
`DEFAULT_TEST_TIMEOUT`. |


was (Author: jamespeach):
| [r/65201|https://reviews.apache.org/r/65201] | Added a global 
DEFAULT_TEST_TIMEOUT variable. |
| [*r/65202|https://reviews.apache.org/*r/65202] | Adopted the libprocess 
`DEFAULT_TEST_TIMEOUT`. |

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: James Peach
>Priority: Major
> Fix For: 1.6.0
>
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-01-17 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329711#comment-16329711
 ] 

James Peach edited comment on MESOS-6575 at 1/17/18 11:53 PM:
--

Yeh, I think that using the soft limit is a pretty good idea. We can set the 
soft limit to the resources and the hard limit to resource + a fudge factor. We 
can kill applications based on either directly observing soft limit breaches, 
or the quota warnings (need to check whether XFS will reset them if the task 
goes back under the soft limit).

We should think about how to make this behaviour configurable per-task, since I 
still want to support the non-destructive case for storage tasks that can 
manage their space.


was (Author: jamespeach):
Yeh, I think that using the soft limit is a pretty good idea. We can set the 
soft limit to the resources and the hard limit to resource + a fudge factor. We 
can kill applications based on either directly observing soft limit breaches, 
or the quota warnings (need to check whether XFS will reset them if the task 
goes back under the soft limit).

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-01-17 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329711#comment-16329711
 ] 

James Peach commented on MESOS-6575:


Yeh, I think that using the soft limit is a pretty good idea. We can set the 
soft limit to the resources and the hard limit to resource + a fudge factor. We 
can kill applications based on either directly observing soft limit breaches, 
or the quota warnings (need to check whether XFS will reset them if the task 
goes back under the soft limit).

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-01-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6575:
--

Assignee: James Peach

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7016) Make default AWAIT_* duration configurable

2018-01-17 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329588#comment-16329588
 ] 

James Peach commented on MESOS-7016:


I have most of a patch that adds a global variable for the default timeout to 
{{libprocess}} and a Mesos test suite flag to configure it.

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: James Peach
>Priority: Major
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7016) Make default AWAIT_* duration configurable

2018-01-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-7016:
--

Assignee: James Peach

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: James Peach
>Priority: Major
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8440) `network/ports` isolator kills legitimate tasks on recovery.

2018-01-12 Thread James Peach (JIRA)
James Peach created MESOS-8440:
--

 Summary: `network/ports` isolator kills legitimate tasks on 
recovery.
 Key: MESOS-8440
 URL: https://issues.apache.org/jira/browse/MESOS-8440
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.5.0
Reporter: James Peach
Assignee: James Peach


At recovery time, the containerizer sends all the resources *except* the ports. 
This means that the ports check will race against the subsequent resources 
update. The root cause of this is that only the executor resources are provided 
at recovery time, whereas at update time the isolator gets the whole container 
resources as calculated by {{Executor::allocatedResources()}}.

{noformat}
I0112 08:22:23.930830 28937 linux_launcher.cpp:300] Recovered container 
80a2d9dc-0492-4af5-a131-05f1cd66d672
I0112 08:22:23.931637 28933 ports.cpp:398] recovering container executor_info {
  executor_id {
value: "fff42f68-4aed-4ca6-a62f-71b7166bbd7a"
  }
  resources {
name: "cpus"
type: SCALAR
scalar {
  value: 0.1
}
allocation_info {
  role: "*"
}
  }
  resources {
name: "mem"
type: SCALAR
scalar {
  value: 32
}
allocation_info {
  role: "*"
}
  }
  command {
value: "/home/jpeach/src/mesos/build/src/mesos-executor"
shell: false
arguments: "mesos-executor"
arguments: "--launcher_dir=/home/jpeach/src/mesos/build/src"
  }
  framework_id {
value: "4ad59c30-7b1e-4991-bda2-e7f9275d3693-"
  }
  name: "Command Executor (Task: fff42f68-4aed-4ca6-a62f-71b7166bbd7a) 
(Command: sh -c \'nc -k -l 31446\')"
  source: "fff42f68-4aed-4ca6-a62f-71b7166bbd7a"
}
container_id {
  value: "80a2d9dc-0492-4af5-a131-05f1cd66d672"
}
pid: 28955
directory: 
"/tmp/NetworkPortsIsolatorTest_ROOT_NC_RecoverGoodTask_eTlVKl/slaves/4ad59c30-7b1e-4991-bda2-e7f9275d3693-S0/frameworks/4ad59c30-7b1e-4991-bda2-e7f9275d3693-/executors/fff42f68-4aed-4ca6-a62f-71b7166bbd7a/runs/80a2d9dc-0492-4af5-a131-05f1cd66d672"
I0112 08:22:23.932137 28933 ports.cpp:530] Updated ports to [] for container 
80a2d9dc-0492-4af5-a131-05f1cd66d672
I0112 08:22:23.932982 28937 provisioner.cpp:493] Provisioner recovery complete
I0112 08:22:23.933924 28928 slave.cpp:6581] Sending reconnect request to 
executor 'fff42f68-4aed-4ca6-a62f-71b7166bbd7a' of framework 
4ad59c30-7b1e-4991-bda2-e7f9275d3693- at executor(1)@17.228.224.108:42187
I0112 08:22:23.934587 28957 exec.cpp:282] Received reconnect request from agent 
4ad59c30-7b1e-4991-bda2-e7f9275d3693-S0
I0112 08:22:23.935724 28931 slave.cpp:4426] Received re-registration message 
from executor 'fff42f68-4aed-4ca6-a62f-71b7166bbd7a' of framework 
4ad59c30-7b1e-4991-bda2-e7f9275d3693-
I0112 08:22:23.936646 28967 exec.cpp:259] Executor re-registered on agent 
4ad59c30-7b1e-4991-bda2-e7f9275d3693-S0
I0112 08:22:23.936820 28929 ports.cpp:530] Updated ports to [31446-31446] for 
container 80a2d9dc-0492-4af5-a131-05f1cd66d672
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319625#comment-16319625
 ] 

James Peach commented on MESOS-8413:


There's a similar issue with URLs for the {{CommandInfo.URI}} message. IIRC 
when I looked into that, the problem was that there was no code to crack the 
credentials out of the URL, so it wasn't even clear that the URL credentials 
didn't just happen to work by accident. These passwords end up in log files.

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
> "recovery_agent_removal_limit": "100%",
> "registry": "replicated_log",
> "registry_fetch_timeout": "1mins",
> "registry_gc_interval": "15mins",
> "registry_max_agent_age": "2weeks",
> "registry_max_agent_count": "102400",
> "registry_store_timeout": "20secs",
> "registry_strict": "false",
> "require_agent_domain": "false",
> "root_submissions": "true",
> "user_sorter": "drf",
> 

[jira] [Commented] (MESOS-8348) Enable function sections in the build.

2018-01-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318864#comment-16318864
 ] 

James Peach commented on MESOS-8348:


No apparent performance difference with a quick and arbitrary benchmark.

*Without GC unused sections:*

{noformat}
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
Starting reregistration for all agents
Reregistered 2000 agents with a total of 10 running tasks and 10 
completed tasks in 28.812622779secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
 (60329 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
Starting reregistration for all agents
Reregistered 2000 agents with a total of 20 running tasks and 0 completed 
tasks in 39.378296252secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
 (98509 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
Starting reregistration for all agents
Reregistered 2 agents with a total of 10 running tasks and 0 completed 
tasks in 45.240454686secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
 (80371 ms)
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test 
(239209 ms total)
{noformat}

*With GC unused sections:*

{noformat}
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
Starting reregistration for all agents
Reregistered 2000 agents with a total of 10 running tasks and 10 
completed tasks in 28.751620417secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
 (59282 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
Starting reregistration for all agents
Reregistered 2000 agents with a total of 20 running tasks and 0 completed 
tasks in 40.010202034secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
 (96938 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
Starting reregistration for all agents
Reregistered 2 agents with a total of 10 running tasks and 0 completed 
tasks in 44.541095336secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
 (79331 ms)
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test 
(235551 ms total)
{noformat}


> Enable function sections in the build.
> --
>
> Key: MESOS-8348
> URL: https://issues.apache.org/jira/browse/MESOS-8348
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>
> Enable {{-ffunction-sections}} to improve the ability of the toolchain to 
> remove unused code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.

2018-01-05 Thread James Peach (JIRA)
James Peach created MESOS-8410:
--

 Summary: Reconfiguration policy fails to handle mount disk 
resources.
 Key: MESOS-8410
 URL: https://issues.apache.org/jira/browse/MESOS-8410
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos agents 
that had mount disk resources configured, and it looks like the agent confused 
the size of the mount disk with the size of the work directory resource:


{noformat}
E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to 
perform recovery: Configuration change not permitted under 'additive' policy: 
Value of scalar resource 'disk' decreased from 183 to 868000
{noformat}

The {{--resources}} flag is
{noformat}
--resources="[
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 868000
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/a"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/b"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/c"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/d"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/e"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/f"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/g"
}
  }
}
  }
  ,
  {
"name": "disk",
"type": "SCALAR",
"scalar": {
  "value": 183
},
"disk": {
  "source": {
"type": "MOUNT",
"mount": {
  "root" : "/srv/mesos/volumes/h"
}
  }
}
  }
]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8404) Improve image puller error messages.

2018-01-05 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8404:
---
Description: 
Saw this error message from the local docker puller:
{noformat}
Failed to launch container: Failed to read manifest: Failed to open file: No 
such file or directory.
{noformat}

Two problems with this
# The error message from {{os::read}} is too verbose
# The error message from the puller doesn't tell it what it failed to read


  was:
Saw this error message from the local docker puller:
{noformat}
Failed to launch container: Failed to read manifest: Failed to open file: No 
such file or directory.
{noformat}

Two problems with this
# The error message from {os::read}} is too verbose
# The error message from the puller doesn't tell it what it failed to read



> Improve image puller error messages.
> 
>
> Key: MESOS-8404
> URL: https://issues.apache.org/jira/browse/MESOS-8404
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> Saw this error message from the local docker puller:
> {noformat}
> Failed to launch container: Failed to read manifest: Failed to open file: No 
> such file or directory.
> {noformat}
> Two problems with this
> # The error message from {{os::read}} is too verbose
> # The error message from the puller doesn't tell it what it failed to read



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8405) Update master task loss handling.

2018-01-05 Thread James Peach (JIRA)
James Peach created MESOS-8405:
--

 Summary: Update master task loss handling.
 Key: MESOS-8405
 URL: https://issues.apache.org/jira/browse/MESOS-8405
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


>From [~agentvindo.dev] in [r/64940|https://reviews.apache.org/r/64940/]:

{quote}
Ideally, we want terminal but unacknowledged tasks to still be marked 
unreachable in some way, either via task state being TASK_UNREACHABLE or task 
being present in unreachableTasks. This allows, for example, the WebUI to not 
show sandbox links for unreachable tasks irrespective of whether they were 
terminal or not before going unreachable. 

But doing this is tricky for various reasons:

--> updateTask() doesn't allow a terminal state to be transitioned to 
TASK_UNREACHABLE. Right now when we call updateTask for a terminal task, it 
adds TASK_UNREACHABLE status to Task.statuses and also sends it to operator API 
stream subscribers which looks incorrect. The fact that updateTask internally 
deals with already terminal tasks is a bad design decision in retrospect. I 
think the callers shouldn't call it for terminal tasks instead.

--> It's not clear to our users what a completed task means. The intention was 
for this to hold a cache of terminal and acknowledged tasks for storing recent 
history. The users of the WebUI probably equate "Completed Tasks" to terminal 
tasks irrespective of their acknowledgement status, which is why it is 
confusing for them to see terminal but unacknowledged tasks in the "Active 
tasks" section in the WebUI.

--> When a framework reconciles the state of a task on an unreachable agent, 
master replies with TASK_UNREACHABLE irrespective of whether the task was in a 
non-terminal state or terminal but un-acknowledged state or terminal and 
acknowledged state when the agent went unreachable.  

I think the direction we want to go towards is

--> Completed tasks should consist of terminal unacknowledged and terminal 
acknowled tasks, likely in two different data structures.
--> Unreachable tasks should consist of all non-complete tasks on an 
unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
state.
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8404) Improve image puller error messages.

2018-01-05 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8404:
--

Assignee: James Peach

> Improve image puller error messages.
> 
>
> Key: MESOS-8404
> URL: https://issues.apache.org/jira/browse/MESOS-8404
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> Saw this error message from the local docker puller:
> {noformat}
> Failed to launch container: Failed to read manifest: Failed to open file: No 
> such file or directory.
> {noformat}
> Two problems with this
> # The error message from {os::read}} is too verbose
> # The error message from the puller doesn't tell it what it failed to read



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8404) Improve image puller error messages.

2018-01-05 Thread James Peach (JIRA)
James Peach created MESOS-8404:
--

 Summary: Improve image puller error messages.
 Key: MESOS-8404
 URL: https://issues.apache.org/jira/browse/MESOS-8404
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: James Peach
Priority: Minor


Saw this error message from the local docker puller:
{noformat}
Failed to launch container: Failed to read manifest: Failed to open file: No 
such file or directory.
{noformat}

Two problems with this
# The error message from {os::read}} is too verbose
# The error message from the puller doesn't tell it what it failed to read




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8332) Narrow the container sandbox permissions.

2018-01-04 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312210#comment-16312210
 ] 

James Peach edited comment on MESOS-8332 at 1/4/18 11:42 PM:
-

The Mesos {{user@}} list was notified of this change in [this thread| 
https://lists.apache.org/thread.html/3a3f932e946e7b4a603e9fcd7eb218a43b5885cd1d83ffd4ca310fe9@%3Cuser.mesos.apache.org%3E].


was (Author: jamespeach):
The Mesos {{user@}} list was notified of this change in [this thread| 
https://lists.apache.org/thread.html/3a3f932e946e7b4a603e9fcd7eb218a43b5885cd1d83ffd4ca310fe9@%3Cuser.mesos.apache.org%3E]

> Narrow the container sandbox permissions.
> -
>
> Key: MESOS-8332
> URL: https://issues.apache.org/jira/browse/MESOS-8332
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> Sandboxes are currently created with 0755 permissions, which allows anyone 
> with local machine access to inspect their contents. We should make them 0750 
> to limit access to the owning user and group.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8332) Narrow the container sandbox permissions.

2018-01-04 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312210#comment-16312210
 ] 

James Peach commented on MESOS-8332:


The Mesos {{user@}} list was notified of this change in [this thread| 
https://lists.apache.org/thread.html/3a3f932e946e7b4a603e9fcd7eb218a43b5885cd1d83ffd4ca310fe9@%3Cuser.mesos.apache.org%3E]

> Narrow the container sandbox permissions.
> -
>
> Key: MESOS-8332
> URL: https://issues.apache.org/jira/browse/MESOS-8332
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> Sandboxes are currently created with 0755 permissions, which allows anyone 
> with local machine access to inspect their contents. We should make them 0750 
> to limit access to the owning user and group.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8368) Improve HTTP parser to support HTTP/2 messages.

2018-01-02 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308938#comment-16308938
 ] 

James Peach commented on MESOS-8368:


Probably we should implement [SSL_CTX_set_next_protos_advertised_cb 
|https://www.openssl.org/docs/man1.1.0/ssl/SSL_set_alpn_protos.html] and only 
advertise {{http/1.1}}. This ought to prevent HTTP/2 negotiation, though it 
seems pretty aggressing of curl to try HTTP/2 without an explicit negotiation.

> Improve HTTP parser to support HTTP/2 messages.
> ---
>
> Key: MESOS-8368
> URL: https://issues.apache.org/jira/browse/MESOS-8368
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>
> We currently use [http-parser|https://github.com/nodejs/http-parser] to parse 
> HTTP messages. This parser does not work with HTTP/2 requests and responses 
> which as an issue as curl enables HTTP/2 by default for HTTPS connections 
> since its version 7.47.
> The issue has been discovered in some of our tests (e.g. 
> ProvisionerDockerTest) where it crashes with the message {{Failed to decode 
> HTTP responses: Decoding failed}}. See 
> [MESOS-8335|https://issues.apache.org/jira/browse/MESOS-8335] for more 
> details.
> Possible long-term solutions:
> * Upgrade the parser to be compatible with HTTP/2 messages. 
> [http-parser|https://github.com/nodejs/http-parser] has not been updated 
> regularly this past year in favor of 
> [nghttp2|https://github.com/nghttp2/nghttp2] which has a much broader scope. 
> [There is no equivalent of http-parser for HTTP/2 
> yet|https://users.rust-lang.org/t/is-there-anything-similar-to-http-parser-but-for-http2/10721].
> * Test which version of curl is used at startup and report an error if the 
> version is >= 7.47 and the flag {{--http1.0}} is not used in curl (more 
> details regarding this flag are available 
> [here|https://curl.haxx.se/docs/manpage.html].
> In the meantime, we are upgrading our testing machines using a recent version 
> of curl to run with the flag {{--http1.0}} 
> ([MESOS-8335|https://issues.apache.org/jira/browse/MESOS-8335]).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8366) Replace the command executor with the default executor.

2017-12-28 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305959#comment-16305959
 ] 

James Peach edited comment on MESOS-8366 at 12/29/17 4:51 AM:
--

Issues that I have found so far:

# Tests that restart the agent are now required to specify a fixed {{slaveId}}
# Tests that inspect the task sandbox need to now find the nested container 
sandbox
# Tests are likely to require additional expectations (since both the executor 
and task containers might trigger them)
# The IO Switchboard doesn't work in local mode, which breaks command checks.
# Tests that depend on manipulating or intercepting protobuf messages from the 
executor (e.g. {{MasterTest.AgentRestartNoReregister}})

I fixed the `FetcherCacheTest` suite, leaving the following non-root test 
failures:
{noformat}
[==] 310 tests from 130 test cases ran. (367254 ms total)
[  PASSED  ] 292 tests.
[  FAILED  ] 18 tests, listed below:
[  FAILED  ] CommandExecutorCheckTest.CommandCheckTimeout
[  FAILED  ] ContainerLoggerTest.DefaultToSandbox
[  FAILED  ] FetcherCacheHttpTest.HttpCachedConcurrent
[  FAILED  ] FetcherTest.Unzip_ExtractFile
[  FAILED  ] HealthCheckTest.HealthyTask
[  FAILED  ] HealthCheckTest.CheckCommandTimeout
[  FAILED  ] MasterTest.AgentRestartNoReregister
[  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.Reboot, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RegisterDisconnectedSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.MultipleFrameworks, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.ShutdownUnregisteredExecutor
[  FAILED  ] SlaveTest.GetExecutorInfoForTaskWithContainer
[  FAILED  ] ContentType/AgentAPITest.GetState/1, where GetParam() = 
application/json
[  FAILED  ] 
ContentType/AgentAPITest.LaunchNestedContainerSessionUnauthorized/1, where 
GetParam() = application/json
[  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/2, where 
GetParam() = (1, 0)
[  FAILED  ] 
DiskResource/PersistentVolumeTest.DestroyPersistentVolumeMultipleTasks/0, where 
GetParam() = (0, 0)
{noformat}


was (Author: jamespeach):
Issues that I have found so far:

# Tests that restart the agent are now required to specify a fixed {{slaveId}}
# Tests that inspect the task sandbox need to now find the nested container 
sandbox
# Tests are likely to require additional expectations (since both the executor 
and task containers might trigger them)
# The IO Switchboard doesn't work in local mode, which breaks command checks.

I fixed the `FetcherCacheTest` suite, leaving the following non-root test 
failures:
{noformat}
[==] 310 tests from 130 test cases ran. (367254 ms total)
[  PASSED  ] 292 tests.
[  FAILED  ] 18 tests, listed below:
[  FAILED  ] CommandExecutorCheckTest.CommandCheckTimeout
[  FAILED  ] ContainerLoggerTest.DefaultToSandbox
[  FAILED  ] FetcherCacheHttpTest.HttpCachedConcurrent
[  FAILED  ] FetcherTest.Unzip_ExtractFile
[  FAILED  ] HealthCheckTest.HealthyTask
[  FAILED  ] HealthCheckTest.CheckCommandTimeout
[  FAILED  ] MasterTest.AgentRestartNoReregister
[  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.Reboot, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RegisterDisconnectedSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.MultipleFrameworks, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.ShutdownUnregisteredExecutor
[  FAILED  ] SlaveTest.GetExecutorInfoForTaskWithContainer
[  FAILED  ] ContentType/AgentAPITest.GetState/1, where GetParam() = 
application/json
[  FAILED  ] 
ContentType/AgentAPITest.LaunchNestedContainerSessionUnauthorized/1, where 
GetParam() = application/json
[  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/2, where 
GetParam() = (1, 0)
[  FAILED  ] 
DiskResource/PersistentVolumeTest.DestroyPersistentVolumeMultipleTasks/0, where 
GetParam() = (0, 0)
{noformat}

> Replace the command executor with the default executor.
> ---
>
> Key: MESOS-8366
> URL: https://issues.apache.org/jira/browse/MESOS-8366
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: James Peach
>Assignee: 

[jira] [Commented] (MESOS-8366) Replace the command executor with the default executor.

2017-12-28 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305959#comment-16305959
 ] 

James Peach commented on MESOS-8366:


Issues that I have found so far:

# Tests that restart the agent are now required to specify a fixed {{slaveId}}
# Tests that inspect the task sandbox need to now find the nested container 
sandbox
# Tests are likely to require additional expectations (since both the executor 
and task containers might trigger them)
# The IO Switchboard doesn't work in local mode, which breaks command checks.

I fixed the `FetcherCacheTest` suite, leaving the following non-root test 
failures:
{noformat}
[==] 310 tests from 130 test cases ran. (367254 ms total)
[  PASSED  ] 292 tests.
[  FAILED  ] 18 tests, listed below:
[  FAILED  ] CommandExecutorCheckTest.CommandCheckTimeout
[  FAILED  ] ContainerLoggerTest.DefaultToSandbox
[  FAILED  ] FetcherCacheHttpTest.HttpCachedConcurrent
[  FAILED  ] FetcherTest.Unzip_ExtractFile
[  FAILED  ] HealthCheckTest.HealthyTask
[  FAILED  ] HealthCheckTest.CheckCommandTimeout
[  FAILED  ] MasterTest.AgentRestartNoReregister
[  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.Reboot, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RegisterDisconnectedSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.MultipleFrameworks, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.ShutdownUnregisteredExecutor
[  FAILED  ] SlaveTest.GetExecutorInfoForTaskWithContainer
[  FAILED  ] ContentType/AgentAPITest.GetState/1, where GetParam() = 
application/json
[  FAILED  ] 
ContentType/AgentAPITest.LaunchNestedContainerSessionUnauthorized/1, where 
GetParam() = application/json
[  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/2, where 
GetParam() = (1, 0)
[  FAILED  ] 
DiskResource/PersistentVolumeTest.DestroyPersistentVolumeMultipleTasks/0, where 
GetParam() = (0, 0)
{noformat}

> Replace the command executor with the default executor.
> ---
>
> Key: MESOS-8366
> URL: https://issues.apache.org/jira/browse/MESOS-8366
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: James Peach
>Assignee: James Peach
>
> We should use the default executor for all the cases that currently invoke 
> the command executor. This is a straightforward matter of implementing 
> `LaunchTask` in the default executor, and then fixing all the test 
> assumptions that this change will break.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8366) Replace the command executor with the default executor.

2017-12-28 Thread James Peach (JIRA)
James Peach created MESOS-8366:
--

 Summary: Replace the command executor with the default executor.
 Key: MESOS-8366
 URL: https://issues.apache.org/jira/browse/MESOS-8366
 Project: Mesos
  Issue Type: Bug
  Components: agent, executor
Reporter: James Peach
Assignee: James Peach


We should use the default executor for all the cases that currently invoke the 
command executor. This is a straightforward matter of implementing `LaunchTask` 
in the default executor, and then fixing all the test assumptions that this 
change will break.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8337) Invalid state transition attempted when agent is lost.

2017-12-24 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302898#comment-16302898
 ] 

James Peach commented on MESOS-8337:


[~jieyu] This is a blocker for 1.5. I have a wacky patch that needs some 
cleanup and analysis before I can post it.

> Invalid state transition attempted when agent is lost.
> --
>
> Key: MESOS-8337
> URL: https://issues.apache.org/jira/browse/MESOS-8337
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: James Peach
>
> The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to 
> {{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that 
> was added in the same patch.
> {noformat}
> I1214 23:42:16.507931 22396 master.cpp:10155] Removing task 
> mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; 
> cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: 
> *):[31068-31068, 31069-31069, 31072-31072] of framework 
> afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent 
> daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 
> (magent0006.xxx.com)
> F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == 
> TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7643) The order of isolators provided in '--isolation' flag is not preserved and instead sorted alphabetically

2017-12-24 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302897#comment-16302897
 ] 

James Peach commented on MESOS-7643:


[~jieyu] RFC review here https://reviews.apache.org/r/62472/

> The order of isolators provided in '--isolation' flag is not preserved and 
> instead sorted alphabetically
> 
>
> Key: MESOS-7643
> URL: https://issues.apache.org/jira/browse/MESOS-7643
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.2, 1.2.0, 1.3.0
>Reporter: Michael Cherny
>Assignee: James Peach
>Priority: Critical
>  Labels: isolation
>
> According to documentation and comments in code the order of the entries in 
> the --isolation flag should specify the ordering of the isolators. 
> Specifically, the `create` and `prepare` calls for each isolator should run 
> serially in the order in which they appear in the --isolation flag, while the 
> `cleanup` call should be serialized in reverse order (with exception of 
> filesystem isolator which is always first).
> But in fact, the isolators provided in '--isolation' flag are sorted 
> alphabetically.
> That happens in [this line of 
> code|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L377].
>  In this line use of 'set' is done (apparently instead of list or 
> vector) and set is a sorted container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8348) Enable function sections in the build.

2017-12-19 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297792#comment-16297792
 ] 

James Peach edited comment on MESOS-8348 at 12/20/17 2:18 AM:
--

Tested on a 4CPU/8G VM, building without cache, {{GTEST_FILTER="" time make -j2 
check}}.

Without any settings:
{noformat}
11517.45user 1028.58system 1:51:31elapsed 187%CPU (0avgtext+0avgdata 
4823956maxresident)k
8710392inputs+83178080outputs (10126major+275942791minor)pagefaults 0swaps
{noformat}

With CXXFLAGS={{\-ffunction-sections \-fdata-sections}} and 
LDFLAGS={{\-Wl,\--gc-sections}}:
{noformat}
9962.13user 893.62system 1:35:17elapsed 189%CPU (0avgtext+0avgdata 
3923732maxresident)k
1994920inputs+38351264outputs (3577major+239138696minor)pagefaults 0swaps
{noformat}

The build time is improved, and the final linked objects are significantly 
smaller:

|| Artifact || Normal || GC sections ||
| src/.libs/libmesos-1.5.0.so| 766M | 274M| 
| src/mesos-agent| 6.5M | 1.6M| 
| src/mesos-cni-port-mapper  | 1.8M |  65K| 
| src/mesos-containerizer| 2.7M | 477K| 
| src/mesos-default-executor |  13M | 4.6M| 
| src/mesos-docker-executor  | 9.6M | 3.6M| 
| src/mesos-execute  | 7.5M | 2.6M| 
| src/mesos-executor | 7.5M | 2.6M| 
| src/mesos-fetcher  | 6.1M | 1.9M| 
| src/mesos-io-switchboard   | 3.7M | 874K| 
| src/mesos-local| 4.8M | 1.4M| 
| src/mesos-log  | 1.8M | 348K| 
| src/mesos-logrotate-logger | 4.7M | 1.6M| 
| src/mesos-master   | 6.3M | 1.6M| 
| src/mesos-network-helper   | 4.2M | 1.2M| 
| src/mesos-resolve  | 2.7M | 642K| 
| src/mesos-tcp-connect  | 2.3M | 630K| 
| src/mesos-tests| 557M |  89M| 
| src/mesos-usage| 3.0M | 955K| 




was (Author: jamespeach):
Tested on a 4CPU/8G VM, building without cache, {{GTEST_FILTER="" time make -j2 
check}}.

Without any settings:
{noformat}
11517.45user 1028.58system 1:51:31elapsed 187%CPU (0avgtext+0avgdata 
4823956maxresident)k
8710392inputs+83178080outputs (10126major+275942791minor)pagefaults 0swaps
{noformat}

With CXXFLAGS={{-ffunction-sections -fdata-sections}} and 
LDFLAGS={{-Wl,--gc-sections}}:
{noformat}
9962.13user 893.62system 1:35:17elapsed 189%CPU (0avgtext+0avgdata 
3923732maxresident)k
1994920inputs+38351264outputs (3577major+239138696minor)pagefaults 0swaps
{noformat}

The build time is improved, and the final linked objects are significantly 
smaller:

|| Artifact || Normal || GC sections ||
| src/.libs/libmesos-1.5.0.so| 766M | 274M| 
| src/mesos-agent| 6.5M | 1.6M| 
| src/mesos-cni-port-mapper  | 1.8M |  65K| 
| src/mesos-containerizer| 2.7M | 477K| 
| src/mesos-default-executor |  13M | 4.6M| 
| src/mesos-docker-executor  | 9.6M | 3.6M| 
| src/mesos-execute  | 7.5M | 2.6M| 
| src/mesos-executor | 7.5M | 2.6M| 
| src/mesos-fetcher  | 6.1M | 1.9M| 
| src/mesos-io-switchboard   | 3.7M | 874K| 
| src/mesos-local| 4.8M | 1.4M| 
| src/mesos-log  | 1.8M | 348K| 
| src/mesos-logrotate-logger | 4.7M | 1.6M| 
| src/mesos-master   | 6.3M | 1.6M| 
| src/mesos-network-helper   | 4.2M | 1.2M| 
| src/mesos-resolve  | 2.7M | 642K| 
| src/mesos-tcp-connect  | 2.3M | 630K| 
| src/mesos-tests| 557M |  89M| 
| src/mesos-usage| 3.0M | 955K| 



> Enable function sections in the build.
> --
>
> Key: MESOS-8348
> URL: https://issues.apache.org/jira/browse/MESOS-8348
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>
> Enable {{-ffunction-sections}} to improve the ability of the toolchain to 
> remove unused code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8348) Enable function sections in the build.

2017-12-19 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297792#comment-16297792
 ] 

James Peach commented on MESOS-8348:


Tested on a 4CPU/8G VM, building without cache, {{GTEST_FILTER="" time make -j2 
check}}.

Without any settings:
{noformat}
11517.45user 1028.58system 1:51:31elapsed 187%CPU (0avgtext+0avgdata 
4823956maxresident)k
8710392inputs+83178080outputs (10126major+275942791minor)pagefaults 0swaps
{noformat}

With CXXFLAGS={{-ffunction-sections -fdata-sections}} and 
LDFLAGS={{-Wl,--gc-sections}}:
{noformat}
9962.13user 893.62system 1:35:17elapsed 189%CPU (0avgtext+0avgdata 
3923732maxresident)k
1994920inputs+38351264outputs (3577major+239138696minor)pagefaults 0swaps
{noformat}

The build time is improved, and the final linked objects are significantly 
smaller:

|| Artifact || Normal || GC sections ||
| src/.libs/libmesos-1.5.0.so| 766M | 274M| 
| src/mesos-agent| 6.5M | 1.6M| 
| src/mesos-cni-port-mapper  | 1.8M |  65K| 
| src/mesos-containerizer| 2.7M | 477K| 
| src/mesos-default-executor |  13M | 4.6M| 
| src/mesos-docker-executor  | 9.6M | 3.6M| 
| src/mesos-execute  | 7.5M | 2.6M| 
| src/mesos-executor | 7.5M | 2.6M| 
| src/mesos-fetcher  | 6.1M | 1.9M| 
| src/mesos-io-switchboard   | 3.7M | 874K| 
| src/mesos-local| 4.8M | 1.4M| 
| src/mesos-log  | 1.8M | 348K| 
| src/mesos-logrotate-logger | 4.7M | 1.6M| 
| src/mesos-master   | 6.3M | 1.6M| 
| src/mesos-network-helper   | 4.2M | 1.2M| 
| src/mesos-resolve  | 2.7M | 642K| 
| src/mesos-tcp-connect  | 2.3M | 630K| 
| src/mesos-tests| 557M |  89M| 
| src/mesos-usage| 3.0M | 955K| 



> Enable function sections in the build.
> --
>
> Key: MESOS-8348
> URL: https://issues.apache.org/jira/browse/MESOS-8348
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>
> Enable {{-ffunction-sections}} to improve the ability of the toolchain to 
> remove unused code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8348) Enable function sections in the build.

2017-12-19 Thread James Peach (JIRA)
James Peach created MESOS-8348:
--

 Summary: Enable function sections in the build.
 Key: MESOS-8348
 URL: https://issues.apache.org/jira/browse/MESOS-8348
 Project: Mesos
  Issue Type: Bug
  Components: build
Reporter: James Peach


Enable {{-ffunction-sections}} to improve the ability of the toolchain to 
remove unused code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8348) Enable function sections in the build.

2017-12-19 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8348:
--

Assignee: James Peach

> Enable function sections in the build.
> --
>
> Key: MESOS-8348
> URL: https://issues.apache.org/jira/browse/MESOS-8348
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>
> Enable {{-ffunction-sections}} to improve the ability of the toolchain to 
> remove unused code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8340) Add a no-enforce isolation option.

2017-12-15 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293466#comment-16293466
 ] 

James Peach commented on MESOS-8340:


[~jieyu] Do you think this is reasonable?

> Add a no-enforce isolation option.
> --
>
> Key: MESOS-8340
> URL: https://issues.apache.org/jira/browse/MESOS-8340
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>
> Some resource isolators ({{disk/du}}, {{disk/xfs}} and {{network/ports}}) 
> have the ability to run in a no-enforce mode where they report resource usage 
> but do not enforce the allocated resource limit. Rather than a separate flag 
> for each possibility, we could add an agent flag named 
> {{\-\-noenforce-isolation}} that causes the agent to log any limitation 
> raised by the given list of isolators, but would not cause the container to 
> be killed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8340) Add a no-enforce isolation option.

2017-12-15 Thread James Peach (JIRA)
James Peach created MESOS-8340:
--

 Summary: Add a no-enforce isolation option.
 Key: MESOS-8340
 URL: https://issues.apache.org/jira/browse/MESOS-8340
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: James Peach


Some resource isolators ({{disk/du}}, {{disk/xfs}} and {{network/ports}}) have 
the ability to run in a no-enforce mode where they report resource usage but do 
not enforce the allocated resource limit. Rather than a separate flag for each 
possibility, we could add an agent flag named {{\-\-noenforce-isolation}} that 
causes the agent to log any limitation raised by the given list of isolators, 
but would not cause the container to be killed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8337) Invalid state transition attempted when agent is lost.

2017-12-15 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8337:
---
Summary: Invalid state transition attempted when agent is lost.  (was: 
Invalid state transitions when agent is lost)

> Invalid state transition attempted when agent is lost.
> --
>
> Key: MESOS-8337
> URL: https://issues.apache.org/jira/browse/MESOS-8337
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: James Peach
>
> The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to 
> {{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that 
> was added in the same patch.
> {noformat}
> I1214 23:42:16.507931 22396 master.cpp:10155] Removing task 
> mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; 
> cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: 
> *):[31068-31068, 31069-31069, 31072-31072] of framework 
> afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent 
> daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 
> (magent0006.xxx.com)
> F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == 
> TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8337) Invalid state transitions when agent is lost

2017-12-15 Thread James Peach (JIRA)
James Peach created MESOS-8337:
--

 Summary: Invalid state transitions when agent is lost
 Key: MESOS-8337
 URL: https://issues.apache.org/jira/browse/MESOS-8337
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: James Peach


The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to 
{{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that 
was added in the same patch.

{noformat}
I1214 23:42:16.507931 22396 master.cpp:10155] Removing task 
mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; 
cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: 
*):[31068-31068, 31069-31069, 31072-31072] of framework 
afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent 
daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 
(magent0006.xxx.com)
F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == 
TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8332) Narrow the container sandbox permissions.

2017-12-13 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290181#comment-16290181
 ] 

James Peach commented on MESOS-8332:


In tests, I notice that {{chown}} on the executor sandbox path logs a warning 
but doesn't cause a failure, but {{chown}} on nested and standalone container 
paths fails the container. There might be some compatibility concern around 
making this behavior consistent since frameworks can currently be sloppy with 
their user names without failing.

> Narrow the container sandbox permissions.
> -
>
> Key: MESOS-8332
> URL: https://issues.apache.org/jira/browse/MESOS-8332
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> Sandboxes are currently created with 0755 permissions, which allows anyone 
> with local machine access to inspect their contents. We should make them 0750 
> to limit access to the owning user and group.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8332) Narrow the container sandbox permissions.

2017-12-13 Thread James Peach (JIRA)
James Peach created MESOS-8332:
--

 Summary: Narrow the container sandbox permissions.
 Key: MESOS-8332
 URL: https://issues.apache.org/jira/browse/MESOS-8332
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach
Assignee: James Peach
Priority: Minor


Sandboxes are currently created with 0755 permissions, which allows anyone with 
local machine access to inspect their contents. We should make them 0750 to 
limit access to the owning user and group.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8330) Document nested container ACLs

2017-12-13 Thread James Peach (JIRA)
James Peach created MESOS-8330:
--

 Summary: Document nested container ACLs
 Key: MESOS-8330
 URL: https://issues.apache.org/jira/browse/MESOS-8330
 Project: Mesos
  Issue Type: Bug
  Components: containerization, documentation
Reporter: James Peach


None of the nested container ACLs are documented. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-13 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289761#comment-16289761
 ] 

James Peach commented on MESOS-8306:


This approach depends on all the agents in a specific class registering with 
the same principal, right? That seems like a bad idea.

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-11 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286509#comment-16286509
 ] 

James Peach edited comment on MESOS-8306 at 12/11/17 9:26 PM:
--

Can you be more specific about the proposal? I can't match your description up 
to the ACLs docs.


was (Author: jamespeach):
That generally sounds reasonable to me. I expect you want to mirror this into 
{{UnreserveResources}} for consistency. Think about how this could be extended, 
e.g. reserve only {{disk}} or {{cpu}} resources.

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-11 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286509#comment-16286509
 ] 

James Peach commented on MESOS-8306:


That generally sounds reasonable to me. I expect you want to mirror this into 
{{UnreserveResources}} for consistency. Think about how this could be extended, 
e.g. reserve only {{disk}} or {{cpu}} resources.

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.

2017-12-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284945#comment-16284945
 ] 

James Peach commented on MESOS-8317:


The executor failed because it had older protobufs than the scheduler. It was 
using the JSON content type and the Go jsonpb package pukes if it receives a 
field that it doesn't know about. The field in question was the {{protocol}} 
field in the {{HealthCheck}} message.

> Check failed when newly registered executor has launched tasks.
> ---
>
> Key: MESOS-8317
> URL: https://issues.apache.org/jira/browse/MESOS-8317
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>
> This check in {{slave/slave.cpp}} can fail:
> {code}
>4105   if (state != RECOVERING &&
>4106   executor->queuedTasks.empty() &&
>4107   executor->queuedTaskGroups.empty()) {
>4108 CHECK(executor->launchedTasks.empty())
>4109 << " Newly registered executor '" << executor->id
>4110 << "' has launched tasks";
>4111 
>4112 LOG(WARNING) << "Shutting down the executor " << *executor
>4113  << " because it has no tasks to run";
>4114 
>4115 _shutdownExecutor(framework, executor);
>4116 
>4117 return;
>4118   }
> {code}
> This happens with the following sequence of events:
> 1. HTTP executor subscribes
> 2. Agent sends a LAUNCH message that the executor can't decode
> 3. HTTP executor closes the channel and re-subscribes
> 4. Agent hits the above check because the executor sends and empty task list 
> (it never understood the LAUNCH message), but the agent thinks that a task 
> should have been launched.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8317) Check failed when newly registered executor has launched tasks.

2017-12-08 Thread James Peach (JIRA)
James Peach created MESOS-8317:
--

 Summary: Check failed when newly registered executor has launched 
tasks.
 Key: MESOS-8317
 URL: https://issues.apache.org/jira/browse/MESOS-8317
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


This check in {{slave/slave.cpp}} can fail:

{code}
   4105   if (state != RECOVERING &&
   4106   executor->queuedTasks.empty() &&
   4107   executor->queuedTaskGroups.empty()) {
   4108 CHECK(executor->launchedTasks.empty())
   4109 << " Newly registered executor '" << executor->id
   4110 << "' has launched tasks";
   4111 
   4112 LOG(WARNING) << "Shutting down the executor " << *executor
   4113  << " because it has no tasks to run";
   4114 
   4115 _shutdownExecutor(framework, executor);
   4116 
   4117 return;
   4118   }
{code}

This happens with the following sequence of events:

1. HTTP executor subscribes
2. Agent sends a LAUNCH message that the executor can't decode
3. HTTP executor closes the channel and re-subscribes
4. Agent hits the above check because the executor sends and empty task list 
(it never understood the LAUNCH message), but the agent thinks that a task 
should have been launched.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.

2017-12-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284378#comment-16284378
 ] 

James Peach commented on MESOS-8317:


/cc [~vinodkone]

> Check failed when newly registered executor has launched tasks.
> ---
>
> Key: MESOS-8317
> URL: https://issues.apache.org/jira/browse/MESOS-8317
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>
> This check in {{slave/slave.cpp}} can fail:
> {code}
>4105   if (state != RECOVERING &&
>4106   executor->queuedTasks.empty() &&
>4107   executor->queuedTaskGroups.empty()) {
>4108 CHECK(executor->launchedTasks.empty())
>4109 << " Newly registered executor '" << executor->id
>4110 << "' has launched tasks";
>4111 
>4112 LOG(WARNING) << "Shutting down the executor " << *executor
>4113  << " because it has no tasks to run";
>4114 
>4115 _shutdownExecutor(framework, executor);
>4116 
>4117 return;
>4118   }
> {code}
> This happens with the following sequence of events:
> 1. HTTP executor subscribes
> 2. Agent sends a LAUNCH message that the executor can't decode
> 3. HTTP executor closes the channel and re-subscribes
> 4. Agent hits the above check because the executor sends and empty task list 
> (it never understood the LAUNCH message), but the agent thinks that a task 
> should have been launched.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8313) Provide a host namespace container supervisor.

2017-12-07 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282757#comment-16282757
 ] 

James Peach commented on MESOS-8313:


{quote}
The other draw back is that we created another nanny process in addition to the 
one that'll perform pid 1 reaping.
{quote}

Right. Currently, the supervisor is optional and inside the container. In this 
proposal, there would always be a supervisor outside the container, though I 
think that the one inside the container would remain optional.

> Provide a host namespace container supervisor.
> --
>
> Key: MESOS-8313
> URL: https://issues.apache.org/jira/browse/MESOS-8313
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
> Attachments: IMG_2629.JPG
>
>
> After more investigation on user namespaces, the current implementation of 
> creating the container namespaces needs some adjustment before we can 
> implement user namespaces in a useable fashion.
> The problems we need to address are:
> 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace 
> to mount {{procfs}}. Currently, this prevents containers joining the host PID 
> namespace. The workaround is to always create a new container PID namespace 
> (as a child of the user namespace) with the {{namespaces/pid}} isolator.
> 2. The containerized needs to hold {{CAP_SYS_ADMIN}} over the network 
> namespace to mount {{sysfs}}. There's no general workaround for this since we 
> can't generally require containers to not join the host network namespace.
> 3. The containerizer can't enter a user namespace after entering the 
> {{chroot}}. This restriction makes the existing order of containerizer 
> operations impossible to remain in the case where we want the executor to be 
> in a new user namespace that has no children (i.e. to protect the container 
> from a privileged task).
> After some discussion with [~jieyu], we believe that we can some most or all 
> of these issues by creating a new containerized supervisor that runs fully 
> outside the container and is responsible for constructing the roots mount 
> namespace, launching the containerized to enter the rest of the container, 
> and waiting on the entered process.
> Since this new supervisor process is not running in the user namespace, it 
> will be able to construct the container rootfs in a new mount namespace 
> without user namespace restrictions. We can then clone a child to fully 
> create and enter container namespaces along with the prefabricated rootfs 
> mount namespace.
> The only drawback to this approach is that the container's mount namespace 
> will be owned by the root user namespace rather than the container user 
> namespace. We are OK with this for now.
> The plan here is to retain the existing {{mesos-containerizer launch}} 
> subcommand and add a new {{mesos-containerizer supervise}} subcommand, which 
> will be its parent process. This new subcommand will be used for the default 
> executor and custom executor code paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8142) Improve container security with user namespaces.

2017-12-07 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8142:
---
Summary: Improve container security with user namespaces.  (was: Improve 
container security with user namespaces)

> Improve container security with user namespaces.
> 
>
> Key: MESOS-8142
> URL: https://issues.apache.org/jira/browse/MESOS-8142
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, security
>Reporter: James Peach
>Assignee: James Peach
>
> As a first pass at supporting user namespaces, figure out how we can use them 
> to improve container security when running untrusted tasks.
> This ticket is specifically targeting how to build a user namespace hierarchy 
> and excluding any sort of ID mapping for the container images.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


<    1   2   3   4   5   6   7   8   >