[jira] [Assigned] (MESOS-7029) FaultToleranceTest.FrameworkReregister is flaky

2017-03-06 Thread Jay Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Guo reassigned MESOS-7029:
--

Assignee: Jay Guo

> FaultToleranceTest.FrameworkReregister is flaky
> ---
>
> Key: MESOS-7029
> URL: https://issues.apache.org/jira/browse/MESOS-7029
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, libevent/SSL enabled
>Reporter: Greg Mann
>Assignee: Jay Guo
>  Labels: flaky, flaky-test
> Attachments: FaultToleranceTest.FrameworkReregister.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/fault_tolerance_tests.cpp:903: Failure
> The difference between registerTime.secs() and 
> framework.values["registered_time"].as().as() is 
> 1.0100052356719971, which exceeds 1, where
> registerTime.secs() evaluates to 1485732879.7673652,
> framework.values["registered_time"].as().as() evaluates 
> to 1485732878.75736, and
> 1 evaluates to 1.
> {code}
> Find the full log attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7029) FaultToleranceTest.FrameworkReregister is flaky

2017-03-06 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898840#comment-15898840
 ] 

Jay Guo commented on MESOS-7029:


RR: https://reviews.apache.org/r/57364/

> FaultToleranceTest.FrameworkReregister is flaky
> ---
>
> Key: MESOS-7029
> URL: https://issues.apache.org/jira/browse/MESOS-7029
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, libevent/SSL enabled
>Reporter: Greg Mann
>  Labels: flaky, flaky-test
> Attachments: FaultToleranceTest.FrameworkReregister.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/fault_tolerance_tests.cpp:903: Failure
> The difference between registerTime.secs() and 
> framework.values["registered_time"].as().as() is 
> 1.0100052356719971, which exceeds 1, where
> registerTime.secs() evaluates to 1485732879.7673652,
> framework.values["registered_time"].as().as() evaluates 
> to 1485732878.75736, and
> 1 evaluates to 1.
> {code}
> Find the full log attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7029) FaultToleranceTest.FrameworkReregister is flaky

2017-03-06 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898835#comment-15898835
 ] 

Jay Guo commented on MESOS-7029:


[~neilc] I think it is due to our intentional delay here: 
https://github.com/apache/mesos/blob/master/src/tests/fault_tolerance_tests.cpp#L824-L826
 where the sum of them may exceed 1 sec

> FaultToleranceTest.FrameworkReregister is flaky
> ---
>
> Key: MESOS-7029
> URL: https://issues.apache.org/jira/browse/MESOS-7029
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, libevent/SSL enabled
>Reporter: Greg Mann
>  Labels: flaky, flaky-test
> Attachments: FaultToleranceTest.FrameworkReregister.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/fault_tolerance_tests.cpp:903: Failure
> The difference between registerTime.secs() and 
> framework.values["registered_time"].as().as() is 
> 1.0100052356719971, which exceeds 1, where
> registerTime.secs() evaluates to 1485732879.7673652,
> framework.values["registered_time"].as().as() evaluates 
> to 1485732878.75736, and
> 1 evaluates to 1.
> {code}
> Find the full log attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7209) Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on windows

2017-03-06 Thread Karen Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898800#comment-15898800
 ] 

Karen Huang commented on MESOS-7209:


Hi Joseph,

The code in cmake file "CompilationConfigure.cmake" is as below:
  ADD_CUSTOM_TARGET(
${ENSURE_TOOL_ARCH} ALL
COMMAND
  IF NOT "%PreferredToolArchitecture%"=="x64" (
echo "ERROR: Environment variable 'PreferredToolArchitecture' must be 
set to 'x64', see MESOS-6720 for details" 1>&2 && EXIT 1
  )
)
But after we genearated project file ensure_tool_arch.vcxproj using cmake. In 
this project file, there is no quotes with variable 
%PreferredToolArchitecture%. It seems that "%PreferredToolArchitecture%"=="x64" 
is convert to %PreferredToolArchitecture%=="x64".

I tried to change the cmake file as below. It works.
From:
  ADD_CUSTOM_TARGET(
${ENSURE_TOOL_ARCH} ALL
COMMAND
  IF NOT "%PreferredToolArchitecture%"=="x64" (
echo "ERROR: Environment variable 'PreferredToolArchitecture' must be 
set to 'x64', see MESOS-6720 for details" 1>&2 && EXIT 1
  )
)
changed to:
  ADD_CUSTOM_TARGET(
${ENSURE_TOOL_ARCH} ALL
COMMAND
  IF NOT "'%PreferredToolArchitecture%'"=='x64' (
echo "ERROR: Environment variable 'PreferredToolArchitecture' must be 
set to 'x64', see MESOS-6720 for details" 1>&2 && EXIT 1
  )



> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on 
> windows
> -
>
> Key: MESOS-7209
> URL: https://issues.apache.org/jira/browse/MESOS-7209
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10 (64bit) + VS2015 Update 3
>Reporter: Karen Huang
>
> I try to build mesos with Debug|x64 configuration on Windows. It failed to 
> build due to error MSB6006: "cmd.exe" exited with code 
> 255.[F:\mesos\build_x64\ensure_tool_arch.vcxproj]. This error is reported 
> when build ensure_tool_arch.vcxproj project.
> Here is repro steps:
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\mesos\src
> 2. Open a VS amd64 command prompt as admin and browse to F:\mesos\src
> 3. set PreferredToolArchitecture=x64
> 4. bootstrap.bat
> 5. mkdir build_x64 && pushd build_x64
> 6. cmake ..\src -G "Visual Studio 14 2015 Win64" -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin"
> 7. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /m /t:Rebuild
> Error message:
>  CustomBuild:
>  Building Custom Rule F:/mesos/src/CMakeLists.txt
>  CMake does not need to re-run because 
> F:\mesos\build_x64\CMakeFiles\generate.stamp is up-to-date.
>  ( was unexpected at this time.
> 43>C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 255. 
> [F:\mesos\build_x64\ensure_tool_arch.vcxproj]
> If you build the project ensure_tool_arch.vcxproj in VS IDE seperatly. The 
> error info is as bleow:
> 2>-- Rebuild All started: Project: ensure_tool_arch, Configuration: Debug 
> x64 --
> 2>  Building Custom Rule D:/Mesos/src/CMakeLists.txt
> 2>  CMake does not need to re-run because 
> D:\Mesos\build_x64\CMakeFiles\generate.stamp is up-to-date.
> 2>  ( was unexpected at this time.
> 2>C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 255.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7149) Support reservations for role subtrees

2017-03-06 Thread Jay Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Guo reassigned MESOS-7149:
--

Assignee: Jay Guo

> Support reservations for role subtrees
> --
>
> Key: MESOS-7149
> URL: https://issues.apache.org/jira/browse/MESOS-7149
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Neil Conway
>Assignee: Jay Guo
>  Labels: mesosphere
>
> When a reservation is made for a role path {{x}}, the reserved resource 
> should be offered to all frameworks registered in {{x}} _or any nested role 
> in the sub-tree under x_. For example, if a reservation is made for {{eng}}, 
> the reserved resource should be a candidate to appear in resource offers to 
> frameworks in any of the roles {{eng}}, {{eng/dev}}, and {{eng/prod}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run

2017-03-06 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898743#comment-15898743
 ] 

Michael Park edited comment on MESOS-7195 at 3/7/17 4:50 AM:
-

[~xujyan] Here's a small example that captures the limitation of variadic 
templates in this context:
{code}
struct S {
  void f(int) const {}
  void f(int, int) const {}
};

template 
void macro(R (T::*)(P), const T&, A) {}

template 
void macro(R (T::*)(P1, P2), const T&, A1, A2) {}

template 
void variadic(R (T::*)(Ps...), const T&, As...) {}

int main() {
  S s;
  macro(::f, s, 42);  // selects `void S::f(int)`
  macro(::f, s, 101, 202);  // selects `void S::f(int, int)`
  // variadic(::f, s, 42);  // error.
  // variadic(::f, s, 101, 202);  // error.
}
{code}

We have situations where there are overloaded member functions, and we happen 
to use
the # of arguments provided to narrow down the # of parameters we need to match.

The same trick doesn't work for variadic templates since the parameters and 
arguments are
both free-form. As far as I know, there's no way to express the same with 
variadic templates.

The macro form, of course, isn't good enough anyway, since it wouldn't work if 
{{f}} were to be
overloaded with different types and the same # of parameters, but we haven't 
run into that just yet.

By API changes, I mean that to make {{variadic}} work, we'll need to require 
the user to
pass something like:
{code}
variadic([](const S& s, auto... args) { s.f(args...); }, s, 101, 202);
{code}

This is not as generic as it needs to be, since it'll only call {{const}} 
functions.
To get the cv/ref qualifiers correct, we'd have to provide the proper overloads,
and maybe try to "hide" it with a macro... but it gets ugly...
{code}
variadic(MEM_FN(S, f), s, 101, 202);
{code}

where {{MEM_FN}} produce an overloaded function object.

Here's a rough sketch of how this could look: 
http://melpon.org/wandbox/permlink/BO8mf7r0CVr3akbu
Note that the sketch is written in C++14.


was (Author: mcypark):
[~xujyan] Here's a small example that captures the limitation of variadic 
templates in this context:
{code}
#include 

struct S {
  void f(int) {}
  void f(int, int) {}
};

template 
void macro(R (T::*)(P), A) {}

template 
void macro(R (T::*)(P1, P2), A1, A2) {}

template 
void variadic(R (T::*)(Ps...) , As...) {}

int main() {
  macro(::f, 42);  // selects `void S::f(int)`
  macro(::f, 101, 202);  // selects `void S::f(int, int)`
  // variadic(::f, 42);  // error.
  // variadic(::f, 101, 202);  // error.
}
{code}

We have situations where there are overloaded member functions, and we happen 
to use
the # of arguments provided to narrow down the # of parameters we need to match.

The same trick doesn't work for variadic templates since the parameters and 
arguments are
both free-form. As far as I know, there's no way to express the same with 
variadic templates.

The macro form, of course, isn't good enough anyway, since it wouldn't work if 
{{f}} were to be
overloaded with different types and the same # of parameters, but we haven't 
run into that just yet.

By API changes, I mean that to make {{variadic}} work, we'll need to require 
the user to
pass something like:
{code}
variadic([](const S& s, auto... args) { s.f(args...); }, 101, 202);
{code}

This is not as generic as it needs to be, since it'll only call {{const}} 
functions.
To get the cv/ref qualifiers correct, we'd have to provide the proper overloads,
and maybe try to "hide" it with a macro... but it gets ugly...
{code}
variadic(MEM_FN(S, f), 101, 202);
{code}

where {{MEM_FN}} produce an overloaded function object.

Here's a rough sketch of how this could look: 
http://melpon.org/wandbox/permlink/BO8mf7r0CVr3akbu
Note that the sketch is written in C++14.

> Use C++11 variadic templates for process::dispatch/defer/delay/async/run
> 
>
> Key: MESOS-7195
> URL: https://issues.apache.org/jira/browse/MESOS-7195
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Yan Xu
>
> These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., 
> {{BOOST_PP_REPEAT_FROM_TO}}):
> {code:title=}
> REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9.
> {code}
> This means we have to bump up the number of repetition whenever we have a new 
> method with more args.
> Seems like we can replace this with C++11 variadic templates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run

2017-03-06 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898743#comment-15898743
 ] 

Michael Park commented on MESOS-7195:
-

[~xujyan] Here's a small example that captures the limitation of variadic 
templates in this context:
{code}
#include 

struct S {
  void f(int) {}
  void f(int, int) {}
};

template 
void macro(R (T::*)(P), A) {}

template 
void macro(R (T::*)(P1, P2), A1, A2) {}

template 
void variadic(R (T::*)(Ps...) , As...) {}

int main() {
  macro(::f, 42);  // selects `void S::f(int)`
  macro(::f, 101, 202);  // selects `void S::f(int, int)`
  // variadic(::f, 42);  // error.
  // variadic(::f, 101, 202);  // error.
}
{code}

We have situations where there are overloaded member functions, and we happen 
to use
the # of arguments provided to narrow down the # of parameters we need to match.

The same trick doesn't work for variadic templates since the parameters and 
arguments are
both free-form. As far as I know, there's no way to express the same with 
variadic templates.

The macro form, of course, isn't good enough anyway, since it wouldn't work if 
{{f}} were to be
overloaded with different types and the same # of parameters, but we haven't 
run into that just yet.

By API changes, I mean that to make {{variadic}} work, we'll need to require 
the user to
pass something like:
{code}
variadic([](const S& s, auto... args) { s.f(args...); }, 101, 202);
{code}

This is not as generic as it needs to be, since it'll only call {{const}} 
functions.
To get the cv/ref qualifiers correct, we'd have to provide the proper overloads,
and maybe try to "hide" it with a macro... but it gets ugly...
{code}
variadic(MEM_FN(S, f), 101, 202);
{code}

where {{MEM_FN}} produce an overloaded function object.

Here's a rough sketch of how this could look: 
http://melpon.org/wandbox/permlink/BO8mf7r0CVr3akbu
Note that the sketch is written in C++14.

> Use C++11 variadic templates for process::dispatch/defer/delay/async/run
> 
>
> Key: MESOS-7195
> URL: https://issues.apache.org/jira/browse/MESOS-7195
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Yan Xu
>
> These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., 
> {{BOOST_PP_REPEAT_FROM_TO}}):
> {code:title=}
> REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9.
> {code}
> This means we have to bump up the number of repetition whenever we have a new 
> method with more args.
> Seems like we can replace this with C++11 variadic templates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898721#comment-15898721
 ] 

Vinod Kone commented on MESOS-7215:
---

Not sure if [~neilc] has cycles. [~xujyan] is this something you can take up?

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898716#comment-15898716
 ] 

Avinash Sridharan edited comment on MESOS-7215 at 3/7/17 4:16 AM:
--

[~vi...@twitter.com] whom should this ticket be assigned to? [~neilc]


was (Author: avin...@mesosphere.io):
[~vi...@twitter.com] whom should this ticket be assigned to?

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898716#comment-15898716
 ] 

Avinash Sridharan commented on MESOS-7215:
--

[~vi...@twitter.com] whom should this ticket be assigned to?

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-03-06 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898712#comment-15898712
 ] 

Avinash Sridharan edited comment on MESOS-7210 at 3/7/17 4:13 AM:
--

[~alexr] ^^ [~gkleiman]


was (Author: avin...@mesosphere.io):
[~alexr] ^^ @gaston kleiman

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Gastón Kleiman
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-03-06 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan reassigned MESOS-7210:


Assignee: Gastón Kleiman

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Gastón Kleiman
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-03-06 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898712#comment-15898712
 ] 

Avinash Sridharan edited comment on MESOS-7210 at 3/7/17 4:13 AM:
--

[~alexr] ^^ @gaston kleiman


was (Author: avin...@mesosphere.io):
[~alexr] ^^

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Gastón Kleiman
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-03-06 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898712#comment-15898712
 ] 

Avinash Sridharan commented on MESOS-7210:
--

[~alexr] ^^

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-6480) Support for docker live-restore option in Mesos

2017-03-06 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898649#comment-15898649
 ] 

haosdent edited comment on MESOS-6480 at 3/7/17 2:59 AM:
-

As check, all docker command would fail when use {{--live-store}} and {{service 
docker stop}}, include {{docker log}} no matter which log-driver we use. After 
chat with [~jieyu], The possible way to resolve this is 

1. 
* {{docker run -d}} to start the program
* {{docker log --since xxx --follow}} to read the log
* If {{docker log}} failed, check if {{/proc/$taskPid}} exist, if the task 
process still exist, keep retry {{docker log}} util {{/proc/$taskPid}} 
disappear or {{docker log}} success again.

The problem of this way is it is a bit tricky to find the timestamp parameter 
in {{docker log --since}}. And some logs may miss

2. 
* Read the {{/run/docker/libcontainerd/$container_id/init-stdout}} and 
{{/run/docker/libcontainerd/$container_id/init-stderr}} directly. This is 
tricky as well. Because it depends on the implementation of docker accross 
different versions. And it don't allow multiple consumers, which mean if we 
read this file directly, other consumers on {{docker log}} would not see the 
log we got from this file.

In a short word, I think we don't have a perfect solution for this problem 
unless we allow some log missing.


was (Author: haosd...@gmail.com):
As check, all docker command would fail when use {{--live-store}} and {{service 
docker stop}}, include {{docker log}} no matter which log-driver we use. After 
chat with Jie Yu, The possible way to resolve this is 

1. 
* {{docker run -d}} to start the program
* {{docker log --since xxx --follow}} to read the log
* If {{docker log}} failed, check if {{/proc/$taskPid}} exist, if the task 
process still exist, keep retry {{docker log}} util {{/proc/$taskPid}} 
disappear or {{docker log}} success again.

The problem of this way is it is a bit tricky to find the timestamp parameter 
in {{docker log --since}}. And some logs may miss

2. 
* Read the {{/run/docker/libcontainerd/$container_id/init-stdout}} and 
{{/run/docker/libcontainerd/$container_id/init-stderr}} directly. This is 
tricky as well. Because it depends on the implementation of docker accross 
different versions. And it don't allow multiple consumers, which mean if we 
read this file directly, other consumers on {{docker log}} would not see the 
log we got from this file.

In a short word, I think we don't have a perfect solution for this problem 
unless we allow some log missing.

> Support for docker live-restore option in Mesos
> ---
>
> Key: MESOS-6480
> URL: https://issues.apache.org/jira/browse/MESOS-6480
> Project: Mesos
>  Issue Type: Task
>Reporter: Milind Chawre
>
> Docker-1.12 supports live-restore option which keeps containers alive during 
> docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
> I tried to use this option in my Mesos setup And  observed this :
> 1. On mesos worker node stop docker daemon.
> 2. After some time start the docker daemon. All the containers running on 
> that are still visible using "docker ps". This is an expected behaviour of 
> live-restore option.
> 3. When I check mesos and marathon UI. It shows no Active tasks running on 
> that node. The containers which are still running on that node are now 
> scheduled on different mesos nodes, which is not right since I can see the 
> containers in "docker ps" output because of live-restore option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6480) Support for docker live-restore option in Mesos

2017-03-06 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898649#comment-15898649
 ] 

haosdent commented on MESOS-6480:
-

As check, all docker command would fail when use {{--live-store}} and {{service 
docker stop}}, include {{docker log}} no matter which log-driver we use. After 
chat with Jie Yu, The possible way to resolve this is 

1. 
* {{docker run -d}} to start the program
* {{docker log --since xxx --follow}} to read the log
* If {{docker log}} failed, check if {{/proc/$taskPid}} exist, if the task 
process still exist, keep retry {{docker log}} util {{/proc/$taskPid}} 
disappear or {{docker log}} success again.

The problem of this way is it is a bit tricky to find the timestamp parameter 
in {{docker log --since}}. And some logs may miss

2. 
* Read the {{/run/docker/libcontainerd/$container_id/init-stdout}} and 
{{/run/docker/libcontainerd/$container_id/init-stderr}} directly. This is 
tricky as well. Because it depends on the implementation of docker accross 
different versions. And it don't allow multiple consumers, which mean if we 
read this file directly, other consumers on {{docker log}} would not see the 
log we got from this file.

In a short word, I think we don't have a perfect solution for this problem 
unless we allow some log missing.

> Support for docker live-restore option in Mesos
> ---
>
> Key: MESOS-6480
> URL: https://issues.apache.org/jira/browse/MESOS-6480
> Project: Mesos
>  Issue Type: Task
>Reporter: Milind Chawre
>
> Docker-1.12 supports live-restore option which keeps containers alive during 
> docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
> I tried to use this option in my Mesos setup And  observed this :
> 1. On mesos worker node stop docker daemon.
> 2. After some time start the docker daemon. All the containers running on 
> that are still visible using "docker ps". This is an expected behaviour of 
> live-restore option.
> 3. When I check mesos and marathon UI. It shows no Active tasks running on 
> that node. The containers which are still running on that node are now 
> scheduled on different mesos nodes, which is not right since I can see the 
> containers in "docker ps" output because of live-restore option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-03-06 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-6919:


Assignee: Joseph Wu

> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Joseph Wu
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-03-06 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898484#comment-15898484
 ] 

Joseph Wu commented on MESOS-6919:
--

Looks like this affects Unix sockets too:
{code}
  while (true) {
Try create = unix::Socket::create();
ASSERT_SOME(create);

Try address = unix::Address::create(os::mkdtemp().get() + 
"/a");
ASSERT_SOME(address);

Try bind = create->bind(address.get());
ASSERT_SOME(bind);

Try listen = create->listen(10);
ASSERT_SOME(listen);

create->accept().discard();
  }
{code}

> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-03-06 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898414#comment-15898414
 ] 

Joseph Wu edited comment on MESOS-6919 at 3/7/17 12:02 AM:
---

This leak is not strictly limited to the reinitialization logic.  Here is an 
even smaller repro (assuming libprocess is started with SSL):
{code}
  while (true) {
Try create = Socket::create();
ASSERT_SOME(create);

Socket* __s__ = new Socket(create.get());

Try bind = __s__->bind(Address::ANY_ANY());
ASSERT_SOME(bind);

Try listen = __s__->listen(10);
ASSERT_SOME(listen)

__s__->accept().discard();

delete __s__;
__s__ = nullptr;
  }
{code}


was (Author: kaysoky):
This leak is not strictly limited to the reinitialization logic.  Here is an 
even smaller repro (assuming libprocess is started with SSL):
{code}
  while (true) {
Try create = Socket::create();
ASSERT_SOME(create);

Socket* __s__ = new Socket(create.get());

std::cout << "Test socket == " << __s__->get() << std::endl;

Try bind = __s__->bind(Address::ANY_ANY());
ASSERT_SOME(bind);

Try listen = __s__->listen(10);
ASSERT_SOME(listen)

__s__->accept().discard();

delete __s__;
__s__ = nullptr;
  }
{code}

> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-03-06 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898414#comment-15898414
 ] 

Joseph Wu commented on MESOS-6919:
--

This leak is not strictly limited to the reinitialization logic.  Here is an 
even smaller repro (assuming libprocess is started with SSL):
{code}
  while (true) {
Try create = Socket::create();
ASSERT_SOME(create);

Socket* __s__ = new Socket(create.get());

std::cout << "Test socket == " << __s__->get() << std::endl;

Try bind = __s__->bind(Address::ANY_ANY());
ASSERT_SOME(bind);

Try listen = __s__->listen(10);
ASSERT_SOME(listen)

__s__->accept().discard();

delete __s__;
__s__ = nullptr;
  }
{code}

> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run

2017-03-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898378#comment-15898378
 ] 

Yan Xu commented on MESOS-7195:
---

[~mcypark] I am thinking of investigating this. Just would like to solicit some 
feedback first to help me get started.

In the slack channel you mentioned:
{quote}
it’ll be some work to implement the variadic template versions
because it’ll involve some API changes
{quote}

Could you elaborate a bit further?

> Use C++11 variadic templates for process::dispatch/defer/delay/async/run
> 
>
> Key: MESOS-7195
> URL: https://issues.apache.org/jira/browse/MESOS-7195
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Yan Xu
>
> These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., 
> {{BOOST_PP_REPEAT_FROM_TO}}):
> {code:title=}
> REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9.
> {code}
> This means we have to bump up the number of repetition whenever we have a new 
> method with more args.
> Seems like we can replace this with C++11 variadic templates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run

2017-03-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898378#comment-15898378
 ] 

Yan Xu edited comment on MESOS-7195 at 3/6/17 11:30 PM:


[~mcypark] I am thinking of investigating this. Just would like to solicit some 
feedback first to help me get started.

In the slack channel you mentioned:
{quote}
it’ll be some work to implement the variadic template versions
because it’ll involve some API changes
{quote}

Could you elaborate a bit further (plus other suggestions)?


was (Author: xujyan):
[~mcypark] I am thinking of investigating this. Just would like to solicit some 
feedback first to help me get started.

In the slack channel you mentioned:
{quote}
it’ll be some work to implement the variadic template versions
because it’ll involve some API changes
{quote}

Could you elaborate a bit further?

> Use C++11 variadic templates for process::dispatch/defer/delay/async/run
> 
>
> Key: MESOS-7195
> URL: https://issues.apache.org/jira/browse/MESOS-7195
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Yan Xu
>
> These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., 
> {{BOOST_PP_REPEAT_FROM_TO}}):
> {code:title=}
> REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9.
> {code}
> This means we have to bump up the number of repetition whenever we have a new 
> method with more args.
> Seems like we can replace this with C++11 variadic templates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.

2017-03-06 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7214:
--
Issue Type: Improvement  (was: Bug)

+1 to fix this. Will be happy to shepherd.

> StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.
> --
>
> Key: MESOS-7214
> URL: https://issues.apache.org/jira/browse/MESOS-7214
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>
> Therefore resume() gets called repeatedly for each {{UpdateFrameworkMessage}} 
> and all status updates for ALL frameworks are resent unnecessarily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7215:
--
Priority: Critical  (was: Major)

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898358#comment-15898358
 ] 

Vinod Kone commented on MESOS-7215:
---

Interesting.

I guess we never explicitly called out that `ShutdownFrameworkMessage` should 
only be sent when framework is being torn down. But I'm surprised to hear that 
as a consequence of the recent changes the task stays in STAGING forever. I'm 
assuming this is because agent doesn't send a TASK_DROPPED status update since 
it thinks the framework is shutting down.

Sending a `KillTaskMessage` instead of `ShutdownFrameworkMessage` sounds good 
to me.

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898344#comment-15898344
 ] 

Yan Xu commented on MESOS-7215:
---

/cc [~neilc] [~vinodkone] 

Perhaps we should keep the logic that transitions the tasks to {{TASK_LOST}} on 
the master and have the master kill these tasks on the agent without sending 
{{ShutdownFrameworkMessage}}?

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks

2017-03-06 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7215:
--
Summary: Master sends ShutdownFrameworkMessage for all non-partition-aware 
frameworks  (was: Master sends ShutdownFrameworkMessage for all partition-aware 
frameworks)

> Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
> 
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7215) Master sends ShutdownFrameworkMessage for all partition-aware frameworks

2017-03-06 Thread Yan Xu (JIRA)
Yan Xu created MESOS-7215:
-

 Summary: Master sends ShutdownFrameworkMessage for all 
partition-aware frameworks
 Key: MESOS-7215
 URL: https://issues.apache.org/jira/browse/MESOS-7215
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
after it has been removed, the master only sends ShutdownFrameworkMessages to 
the agent for frameworks that it knows have been torn down. 
With the new logic in MESOS-5344, Mesos is now sending 
{{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
frameworks (including the ones that are still registered)

This is problematic. The offer from this agent can still go to the same 
framework which can then launch new tasks. The agent then receives tasks of the 
same framework and ignores them because it thinks the framework is shutting 
down. The framework is not shutting down of course, so from the master and the 
scheduler's perspective the task is pending in STAGING forever until the next 
agent reregistration, which could happen much later.

This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
agent is assuming the framework to be going away (and act accordingly) when 
it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.

2017-03-06 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7214:
--
Description: Therefore resume() gets called repeatedly for each 
{{UpdateFrameworkMessage}} and all status updates for ALL frameworks are resent 
unnecessarily.  (was: Therefore when resume() gets called repeatedly it 
re-flushes all messages unnecessarily.)

> StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.
> --
>
> Key: MESOS-7214
> URL: https://issues.apache.org/jira/browse/MESOS-7214
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> Therefore resume() gets called repeatedly for each {{UpdateFrameworkMessage}} 
> and all status updates for ALL frameworks are resent unnecessarily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.

2017-03-06 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7214:
--
Summary: StatusUpdateManagerProcess::resume() doesn't support resuming a 
single stream.  (was: StatusUpdateManagerProcess::resume() doesn')

> StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.
> --
>
> Key: MESOS-7214
> URL: https://issues.apache.org/jira/browse/MESOS-7214
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> Therefore when resume() gets called repeatedly it re-flushes all messages 
> unnecessarily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn'

2017-03-06 Thread Yan Xu (JIRA)
Yan Xu created MESOS-7214:
-

 Summary: StatusUpdateManagerProcess::resume() doesn'
 Key: MESOS-7214
 URL: https://issues.apache.org/jira/browse/MESOS-7214
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


Therefore when resume() gets called repeatedly it re-flushes all messages 
unnecessarily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5689) `PortMappingIsolatorTest.ROOT_ContainerICMPExternal` fails on Fedora 23.

2017-03-06 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898275#comment-15898275
 ] 

Till Toenshoff commented on MESOS-5689:
---

We need to clarify if this simply a test-failure or an actual bug in 
conjunction with Fedora 23. I am still seeing this when testing 1.1.1-rc2.

> `PortMappingIsolatorTest.ROOT_ContainerICMPExternal` fails on Fedora 23.
> 
>
> Key: MESOS-5689
> URL: https://issues.apache.org/jira/browse/MESOS-5689
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network
> Environment: Fedora 23 with network isolation
>Reporter: Gilbert Song
>  Labels: isolation, mesosphere, networking, tests
>
> Here is the log:
> {noformat}
> [20:17:53] :   [Step 10/10] [ RUN  ] 
> PortMappingIsolatorTest.ROOT_ContainerICMPExternal
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.890225 28395 
> port_mapping_tests.cpp:229] Using eth0 as the public interface
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.890532 28395 
> port_mapping_tests.cpp:237] Using lo as the loopback interface
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.904742 28395 resources.cpp:572] 
> Parsing resources as JSON failed: 
> cpus:2;mem:1024;disk:1024;ephemeral_ports:[30001-30999];ports:[31000-32000]
> [20:17:53]W:   [Step 10/10] Trying semicolon-delimited string format instead
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.905855 28395 
> port_mapping.cpp:1557] Using eth0 as the public interface
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.906159 28395 
> port_mapping.cpp:1582] Using lo as the loopback interface
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907315 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/neigh/default/gc_thresh3 = '1024'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907362 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/neigh/default/gc_thresh1 = '128'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907418 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_wmem = '409616384   4194304'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907454 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_synack_retries = '5'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907491 28395 
> port_mapping.cpp:1869] /proc/sys/net/core/rmem_max = '212992'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907524 28395 
> port_mapping.cpp:1869] /proc/sys/net/core/somaxconn = '128'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907557 28395 
> port_mapping.cpp:1869] /proc/sys/net/core/wmem_max = '212992'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907588 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_rmem = '409687380   6291456'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907618 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_keepalive_time = '7200'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907649 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/neigh/default/gc_thresh2 = '512'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907680 28395 
> port_mapping.cpp:1869] /proc/sys/net/core/netdev_max_backlog = '1000'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907711 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_keepalive_intvl = '75'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907742 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_keepalive_probes = '9'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907773 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_max_syn_backlog = '512'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.907802 28395 
> port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_retries2 = '15'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.916348 28395 
> linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.916575 28395 resources.cpp:572] 
> Parsing resources as JSON failed: ports:[31000-31499]
> [20:17:53]W:   [Step 10/10] Trying semicolon-delimited string format instead
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.917032 28412 
> port_mapping.cpp:2512] Using non-ephemeral ports {[31000,31500)} and 
> ephemeral ports [30016,30032) for container container1 of executor ''
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.918092 28395 
> linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNS | 
> CLONE_NEWNET
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.951756 28410 
> port_mapping.cpp:2576] Bind mounted '/proc/15611/ns/net' to 
> '/run/netns/15611' for container container1
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.951918 28410 
> port_mapping.cpp:2607] Created network namespace handle symlink 
> '/var/run/mesos/netns/container1' -> '/run/netns/15611'
> [20:17:53]W:   [Step 10/10] I0622 20:17:53.952893 28410 
> port_mapping.cpp:2667] Adding IP packet filters with ports [30016,30031] for 
> container container1
> 

[jira] [Assigned] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-03-06 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6919:


Assignee: (was: Greg Mann)

> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky

2017-03-06 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897910#comment-15897910
 ] 

Greg Mann commented on MESOS-7082:
--

I just observed this failure again on our internal CI.

> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is 
> flaky
> 
>
> Key: MESOS-7082
> URL: https://issues.apache.org/jira/browse/MESOS-7082
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04 with/without SSL
>Reporter: Anand Mazumdar
>  Labels: flaky, flaky-test, mesosphere
>
> Showed up on our internal CI
> {noformat}
> 07:00:17 [ RUN  ] 
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> 07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 
> 'local' authorizer
> 07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
> fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started 
> on 10.153.254.29:38570
> 07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
> --zk_session_timeout="10secs"
> 07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> 07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
> authenticated agents to register
> 07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> 07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/ZROfJk/credentials'
> 07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
> authenticator
> 07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
> 07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> 07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
> given
> 07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
> master!
> 07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from 
> registrar
> 07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
> 07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
> the registry (0B) in 176384ns
> 07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations 
> in 7573ns; attempting to update the registry
> 07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
> the registry in 210944ns
> 07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully 
> recovered registrar
> 07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
> the registry (172B); allowing 10mins for agents to re-register
> 07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery 
> of hierarchical allocator: nothing to recover
> 07:00:17 I0207 07:00:17.869381  2952 containerizer.cpp:220] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> 07:00:17 I0207 07:00:17.872557  2952 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as 

[jira] [Issue Comment Deleted] (MESOS-6792) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky

2017-03-06 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-6792:
--
Comment: was deleted

(was: Seeing this fail on ubuntu-14.04 while testing 1.1.1-rc2 -- not a crash 
though!

{noformat}
../../src/tests/containerizer/cgroups_isolator_tests.cpp:438
Expected: (0.05) <= (cpuTime), actual: 0.05 vs 0.04
{noformat})

> MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
> -
>
> Key: MESOS-6792
> URL: https://issues.apache.org/jira/browse/MESOS-6792
> Project: Mesos
>  Issue Type: Bug
>  Components: technical debt, test
> Environment: Fedora 25, clang, w/ optimizations, SSL build
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> The test {{MasterSlaveReconciliationTest.ReconcileLostTask}} is flaky for me 
> as of {{e99ea9ce8b1de01dd8b3cac6675337edb6320f38}},
> {code}
> Repeating all tests (iteration 912) . . .
> Note: Google Test filter = 
> 

[jira] [Comment Edited] (MESOS-6792) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky

2017-03-06 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897894#comment-15897894
 ] 

Till Toenshoff edited comment on MESOS-6792 at 3/6/17 7:36 PM:
---

Seeing this fail on ubuntu-14.04 while testing 1.1.1-rc2 -- not a crash though!

{noformat}
../../src/tests/containerizer/cgroups_isolator_tests.cpp:438
Expected: (0.05) <= (cpuTime), actual: 0.05 vs 0.04
{noformat}


was (Author: tillt):
Seeing the same on ubuntu-14.04 while testing 1.1.1-rc2.

> MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
> -
>
> Key: MESOS-6792
> URL: https://issues.apache.org/jira/browse/MESOS-6792
> Project: Mesos
>  Issue Type: Bug
>  Components: technical debt, test
> Environment: Fedora 25, clang, w/ optimizations, SSL build
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> The test {{MasterSlaveReconciliationTest.ReconcileLostTask}} is flaky for me 
> as of {{e99ea9ce8b1de01dd8b3cac6675337edb6320f38}},
> {code}
> Repeating all tests (iteration 912) . . .
> Note: Google Test filter = 
> 

[jira] [Commented] (MESOS-6792) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky

2017-03-06 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897894#comment-15897894
 ] 

Till Toenshoff commented on MESOS-6792:
---

Seeing the same on ubuntu-14.04 while testing 1.1.1-rc2.

> MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
> -
>
> Key: MESOS-6792
> URL: https://issues.apache.org/jira/browse/MESOS-6792
> Project: Mesos
>  Issue Type: Bug
>  Components: technical debt, test
> Environment: Fedora 25, clang, w/ optimizations, SSL build
>Reporter: Benjamin Bannier
>  Labels: mesosphere
>
> The test {{MasterSlaveReconciliationTest.ReconcileLostTask}} is flaky for me 
> as of {{e99ea9ce8b1de01dd8b3cac6675337edb6320f38}},
> {code}
> Repeating all tests (iteration 912) . . .
> Note: Google Test filter = 
> 

[jira] [Created] (MESOS-7213) SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor fails.

2017-03-06 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-7213:
-

 Summary: SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor fails.
 Key: MESOS-7213
 URL: https://issues.apache.org/jira/browse/MESOS-7213
 Project: Mesos
  Issue Type: Bug
  Components: tests
Affects Versions: 1.1.1
 Environment: Debian 8, SSL/libevent build
Reporter: Till Toenshoff


The following happened while testing 1.1.1-rc2; may be flaky.

{noformat}
[ RUN  ] SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor
I0306 14:16:42.640406 27141 cluster.cpp:158] Creating default 'local' authorizer
I0306 14:16:42.648387 27141 leveldb.cpp:174] Opened db in 7.851169ms
I0306 14:16:42.649245 27141 leveldb.cpp:181] Compacted db in 832265ns
I0306 14:16:42.649266 27141 leveldb.cpp:196] Created db iterator in 4269ns
I0306 14:16:42.649271 27141 leveldb.cpp:202] Seeked to beginning of db in 840ns
I0306 14:16:42.649276 27141 leveldb.cpp:271] Iterated through 0 keys in the db 
in 448ns
I0306 14:16:42.649291 27141 replica.cpp:776] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0306 14:16:42.649471 27163 recover.cpp:451] Starting replica recovery
I0306 14:16:42.649528 27166 recover.cpp:477] Replica is in EMPTY status
I0306 14:16:42.649864 27166 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(5541)@10.99.136.60:39312
I0306 14:16:42.649952 27160 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I0306 14:16:42.650060 27164 recover.cpp:568] Updating replica status to STARTING
I0306 14:16:42.650842 27160 master.cpp:380] Master 
81fb2ed1-6c17-4dfb-a44f-160cfde9741e (ip-10-99-136-60.ec2.internal) started on 
10.99.136.60:39312
I0306 14:16:42.650862 27160 master.cpp:382] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/syBZyN/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry_store_timeout="100secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/syBZyN/master" --zk_session_timeout="10secs"
I0306 14:16:42.651005 27160 master.cpp:432] Master only allowing authenticated 
frameworks to register
I0306 14:16:42.651010 27160 master.cpp:446] Master only allowing authenticated 
agents to register
I0306 14:16:42.651012 27160 master.cpp:459] Master only allowing authenticated 
HTTP frameworks to register
I0306 14:16:42.651016 27160 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/syBZyN/credentials'
I0306 14:16:42.651876 27160 master.cpp:504] Using default 'crammd5' 
authenticator
I0306 14:16:42.651919 27160 http.cpp:887] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0306 14:16:42.652004 27160 http.cpp:887] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0306 14:16:42.652042 27160 http.cpp:887] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0306 14:16:42.652107 27160 master.cpp:584] Authorization enabled
I0306 14:16:42.652220 27161 whitelist_watcher.cpp:77] No whitelist given
I0306 14:16:42.652228 27165 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0306 14:16:42.652447 27162 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 2.311958ms
I0306 14:16:42.652469 27162 replica.cpp:320] Persisted replica status to 
STARTING
I0306 14:16:42.652573 27162 recover.cpp:477] Replica is in STARTING status
I0306 14:16:42.652951 27162 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(5542)@10.99.136.60:39312
I0306 14:16:42.652984 27163 master.cpp:2017] Elected as the leading master!
I0306 14:16:42.652999 27163 master.cpp:1560] Recovering from registrar
I0306 14:16:42.653074 27164 registrar.cpp:329] Recovering registrar
I0306 14:16:42.653125 27165 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0306 14:16:42.653259 27161 recover.cpp:568] Updating replica status to VOTING

[jira] [Commented] (MESOS-4736) DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes fails on CentOS 6

2017-03-06 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897765#comment-15897765
 ] 

Till Toenshoff commented on MESOS-4736:
---

[~kaysoky] do you have any update here?

> DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes fails on 
> CentOS 6
> -
>
> Key: MESOS-4736
> URL: https://issues.apache.org/jira/browse/MESOS-4736
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: Centos6 + GCC 4.9 on AWS
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: flaky, mesosphere, test
>
> This test passes consistently on other OS's, but fails consistently on CentOS 
> 6.
> Verbose logs from test failure:
> {code}
> [ RUN  ] DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes
> I0222 18:16:12.327957 26681 leveldb.cpp:174] Opened db in 7.466102ms
> I0222 18:16:12.330528 26681 leveldb.cpp:181] Compacted db in 2.540139ms
> I0222 18:16:12.330580 26681 leveldb.cpp:196] Created db iterator in 16908ns
> I0222 18:16:12.330592 26681 leveldb.cpp:202] Seeked to beginning of db in 
> 1403ns
> I0222 18:16:12.330600 26681 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 315ns
> I0222 18:16:12.330634 26681 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0222 18:16:12.331082 26698 recover.cpp:447] Starting replica recovery
> I0222 18:16:12.331289 26698 recover.cpp:473] Replica is in EMPTY status
> I0222 18:16:12.332162 26703 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (13761)@172.30.2.148:35274
> I0222 18:16:12.332701 26701 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0222 18:16:12.333230 26699 recover.cpp:564] Updating replica status to 
> STARTING
> I0222 18:16:12.334102 26698 master.cpp:376] Master 
> 652149b4-3932-4d8b-ba6f-8c9d9045be70 (ip-172-30-2-148.mesosphere.io) started 
> on 172.30.2.148:35274
> I0222 18:16:12.334116 26698 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/QEhLBS/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/QEhLBS/master" 
> --zk_session_timeout="10secs"
> I0222 18:16:12.334354 26698 master.cpp:423] Master only allowing 
> authenticated frameworks to register
> I0222 18:16:12.334363 26698 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I0222 18:16:12.334369 26698 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/QEhLBS/credentials'
> I0222 18:16:12.335366 26698 master.cpp:468] Using default 'crammd5' 
> authenticator
> I0222 18:16:12.335492 26698 master.cpp:537] Using default 'basic' HTTP 
> authenticator
> I0222 18:16:12.335623 26698 master.cpp:571] Authorization enabled
> I0222 18:16:12.335752 26703 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 2.314693ms
> I0222 18:16:12.335769 26700 whitelist_watcher.cpp:77] No whitelist given
> I0222 18:16:12.335778 26703 replica.cpp:320] Persisted replica status to 
> STARTING
> I0222 18:16:12.335821 26697 hierarchical.cpp:144] Initialized hierarchical 
> allocator process
> I0222 18:16:12.335965 26701 recover.cpp:473] Replica is in STARTING status
> I0222 18:16:12.336771 26703 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (13763)@172.30.2.148:35274
> I0222 18:16:12.337191 26696 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0222 18:16:12.337635 26700 recover.cpp:564] Updating replica status to VOTING
> I0222 18:16:12.337671 26703 master.cpp:1712] The newly elected leader is 
> master@172.30.2.148:35274 with id 652149b4-3932-4d8b-ba6f-8c9d9045be70
> I0222 18:16:12.337698 26703 master.cpp:1725] Elected as the leading master!
> I0222 18:16:12.337713 26703 master.cpp:1470] Recovering from registrar
> I0222 18:16:12.337828 26696 registrar.cpp:307] Recovering registrar
> I0222 

[jira] [Updated] (MESOS-7208) Persistent volume ownership is set to root when task is running with non-root user

2017-03-06 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7208:
--
Affects Version/s: 1.0.2

> Persistent volume ownership is set to root when task is running with non-root 
> user
> --
>
> Key: MESOS-7208
> URL: https://issues.apache.org/jira/browse/MESOS-7208
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
>
> I’m running docker container in universal containerizer, mesos 1.1.0. 
> switch_user=true, isolator=filesystem/linux,docker/runtime.  Container is 
> launched with marathon, “user”:”someappuser”. I’d want to use persistent 
> volume, but it’s exposed to container with root user permissions even if root 
> folder is created with someppuser ownership (looks like mesos do chown to 
> this folder). 
> here logs for my container:
> {code}
> I0305 22:51:36.414655 10175 slave.cpp:1701] Launching task 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92-
> I0305 22:51:36.415118 10175 paths.cpp:536] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a'
>  to user 'root'
> I0305 22:51:36.422992 10175 slave.cpp:6179] Launching executor 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a'
> I0305 22:51:36.424278 10175 slave.cpp:1987] Queued task 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for executor 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92-
> I0305 22:51:36.424347 10158 docker.cpp:1000] Skipping non-docker container
> I0305 22:51:36.425639 10142 containerizer.cpp:938] Starting container 
> e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a for executor 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92-
> I0305 22:51:36.428725 10166 provisioner.cpp:294] Provisioning image rootfs 
> '/export/intssd/mesos-slave/workdir/provisioner/containers/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/backends/copy/rootfses/0e2181e9-1bf2-42d4-8cb0-ee70e466c3ae'
>  for container e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a
> I0305 22:51:42.981240 10149 linux.cpp:695] Changing the ownership of the 
> persistent volume at 
> '/export/intssd/mesos-slave/data/volumes/roles/general_marathon_service_role/md_hdfs_journal#data#23f813aa-01dd-11e7-a012-0242ce94d92a'
>  with uid 0 and gid 0
> I0305 22:51:42.986593 10136 linux_launcher.cpp:421] Launching container 
> e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a and cloning with namespaces CLONE_NEWNS
> {code}
> {code}
> ls -la 
> /export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/
> drwxr-xr-x 3 someappuser someappgroup   4096 22:51 .
> drwxr-xr-x 3 root root4096 22:51 ..
> drwxr-xr-x 2 root root4096 22:51 data
> -rw-r--r-- 1 root root 169 22:51 stderr
> -rw-r--r-- 1 root root  183012 23:00 stdout
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7208) Persistent volume ownership is set to root when task is running with non-root user

2017-03-06 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7208:
--
Target Version/s: 1.2.0

> Persistent volume ownership is set to root when task is running with non-root 
> user
> --
>
> Key: MESOS-7208
> URL: https://issues.apache.org/jira/browse/MESOS-7208
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
>
> I’m running docker container in universal containerizer, mesos 1.1.0. 
> switch_user=true, isolator=filesystem/linux,docker/runtime.  Container is 
> launched with marathon, “user”:”someappuser”. I’d want to use persistent 
> volume, but it’s exposed to container with root user permissions even if root 
> folder is created with someppuser ownership (looks like mesos do chown to 
> this folder). 
> here logs for my container:
> {code}
> I0305 22:51:36.414655 10175 slave.cpp:1701] Launching task 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92-
> I0305 22:51:36.415118 10175 paths.cpp:536] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a'
>  to user 'root'
> I0305 22:51:36.422992 10175 slave.cpp:6179] Launching executor 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a'
> I0305 22:51:36.424278 10175 slave.cpp:1987] Queued task 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for executor 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92-
> I0305 22:51:36.424347 10158 docker.cpp:1000] Skipping non-docker container
> I0305 22:51:36.425639 10142 containerizer.cpp:938] Starting container 
> e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a for executor 
> 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework 
> e9d0e39e-b67d-4142-b95d-b0987998eb92-
> I0305 22:51:36.428725 10166 provisioner.cpp:294] Provisioning image rootfs 
> '/export/intssd/mesos-slave/workdir/provisioner/containers/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/backends/copy/rootfses/0e2181e9-1bf2-42d4-8cb0-ee70e466c3ae'
>  for container e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a
> I0305 22:51:42.981240 10149 linux.cpp:695] Changing the ownership of the 
> persistent volume at 
> '/export/intssd/mesos-slave/data/volumes/roles/general_marathon_service_role/md_hdfs_journal#data#23f813aa-01dd-11e7-a012-0242ce94d92a'
>  with uid 0 and gid 0
> I0305 22:51:42.986593 10136 linux_launcher.cpp:421] Launching container 
> e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a and cloning with namespaces CLONE_NEWNS
> {code}
> {code}
> ls -la 
> /export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/
> drwxr-xr-x 3 someappuser someappgroup   4096 22:51 .
> drwxr-xr-x 3 root root4096 22:51 ..
> drwxr-xr-x 2 root root4096 22:51 data
> -rw-r--r-- 1 root root 169 22:51 stderr
> -rw-r--r-- 1 root root  183012 23:00 stdout
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7211) Document SUPPRESS HTTP call

2017-03-06 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7211:
--
Labels: newbie  (was: )

> Document SUPPRESS HTTP call
> ---
>
> Key: MESOS-7211
> URL: https://issues.apache.org/jira/browse/MESOS-7211
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 1.1.0
>Reporter: Bruce Merry
>Priority: Minor
>  Labels: newbie
>
> The documentation at 
> http://mesos.apache.org/documentation/latest/scheduler-http-api/ doesn't list 
> the SUPPRESS call as one of the call types, but it does seem to be 
> implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7212) CommandInfo first argument is ignored

2017-03-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897516#comment-15897516
 ] 

Gastón Kleiman commented on MESOS-7212:
---

This behaviour is documented in {{mesos.proto}} 
(https://github.com/apache/mesos/blob/5ac6e156390717c34586e6e19fee4bc4cb6b01d5/include/mesos/mesos.proto#L621-L626):

{noformat}
  // 2) If 'shell == false', the command will be launched by passing
  //arguments to an executable. The 'value' specified will be
  //treated as the filename of the executable. The 'arguments'
  //will be treated as the arguments to the executable. This is
  //similar to how POSIX exec families launch processes (i.e.,
  //execlp(value, arguments(0), arguments(1), ...)).
{noformat}

The POSIX exec calls expect first argument to be the name of the exec'd file.

> CommandInfo first argument is ignored
> -
>
> Key: MESOS-7212
> URL: https://issues.apache.org/jira/browse/MESOS-7212
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: MacOS Sierra 10.12.2
>Reporter: Egor Ryashin
>
> First argument of CommandInfo is ignored, for example using:
> {code}
> CommandInfo commandInfo = CommandInfo.newBuilder() 
> .setShell(false) 
> .addArguments("1") 
> .addArguments("2") 
> .addArguments("3") 
> .setValue("echo")
> {code}
> I get in the sandbox stdout:
> {noformat}
> Starting task ta3e2-6234-4f8c-a609-e4b9064b4cf5
> /usr/local/Cellar/mesos/1.1.0/libexec/mesos/mesos-containerizer launch 
> --command="{"arguments":["1","2","3"],"shell":false,"value":"echo"}" 
> --help="false"
> Forked command at 95660
> 2 3
> Command exited with status 0 (pid: 95660)
>{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7212) CommandInfo first argument is ignored

2017-03-06 Thread Egor Ryashin (JIRA)
Egor Ryashin created MESOS-7212:
---

 Summary: CommandInfo first argument is ignored
 Key: MESOS-7212
 URL: https://issues.apache.org/jira/browse/MESOS-7212
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: MacOS Sierra 10.12.2
Reporter: Egor Ryashin


First argument of CommandInfo is ignored, for example using:
{code}
CommandInfo commandInfo = CommandInfo.newBuilder() 
.setShell(false) 
.addArguments("1") 
.addArguments("2") 
.addArguments("3") 
.setValue("echo")
{code}
I get in the sandbox stdout:
{noformat}
Starting task ta3e2-6234-4f8c-a609-e4b9064b4cf5
/usr/local/Cellar/mesos/1.1.0/libexec/mesos/mesos-containerizer launch 
--command="{"arguments":["1","2","3"],"shell":false,"value":"echo"}" 
--help="false"
Forked command at 95660
2 3
Command exited with status 0 (pid: 95660)
   {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7095) Basic make check from getting started link fails

2017-03-06 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897364#comment-15897364
 ] 

Alexander Rukletsov commented on MESOS-7095:


The problem is likely with packaged {{brew}}'s {{apr}}. Here is what I see on 
my machine:
{noformat}
alex@alexr.local: /usr/local/Cellar/apr/1.5.2_3 $ lla
total 88
-rw-r--r--  1 alex  wheel   7.7K Apr 25  2015 CHANGES
-rw-r--r--  1 alex  staff   534B Feb  7 21:44 INSTALL_RECEIPT.json
-rw-r--r--  1 alex  wheel18K Apr 25  2015 LICENSE
-rw-r--r--  1 alex  wheel   527B Apr 25  2015 NOTICE
-rw-r--r--  1 alex  wheel   5.5K Apr 25  2015 README
drwxr-xr-x  3 alex  wheel   102B Apr 25  2015 bin/
drwxr-xr-x  6 alex  wheel   204B Apr 25  2015 libexec/

alex@alexr.local: /usr/local/Cellar/apr/1.5.2_3 $ lla libexec/include 
total 0
drwxr-xr-x  40 alex  wheel   1.3K Apr 25  2015 apr-1/
{noformat}
Hence the configure script cannot find {{apr}}'s includes. Try telling 
{{configure}} explicitly where {{apr}}'s includes are.

> Basic make check from getting started link fails
> 
>
> Key: MESOS-7095
> URL: https://issues.apache.org/jira/browse/MESOS-7095
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Alec Bruns
>
> {*** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are 
> using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV 
> (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: ***} 
> \{@ 0x7fffb50c7bba _sigtramp 
> @\{ 0x72c0517 (unknown)\} 
> @0x107eaa13a svn_pool_create_ex 
> @0x107691d6e svn::diff() 
> @0x107691042 SVNTest_DiffPatch_Test::TestBody()
>  @0x1077026ba 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076b3ad7 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
>  @0x1076b3985 testing::Test::Run() 
> @0x1076b54f8 testing::TestInfo::Run() 
> @0x1076b6867 testing::TestCase::Run() 
> @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() 
> @0x1077033da 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076c6007 
> testing::internal::HandleExceptionsInMethodIfSupported<>() 
> @0x1076c5ed8 testing::UnitTest::Run() 
> @0x1074d55c1 RUN_ALL_TESTS() 
> @0x1074d5580 main 
> @ 0x7fffb4eba255 start 
> make[6]: *** [check-local] Segmentation fault: 11 
> make[5]: *** [check-am] Error 2 make[4]: *** [check-recursive] Error 1
>  make[3]: *** [check] Error 2 make[2]: *** [check-recursive] Error 1 
> make[1]: *** [check] Error 2 make: *** [check-recursive] Error 1
> make: *** [check-recursive] Error 1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7211) Document SUPPRESS HTTP call

2017-03-06 Thread Bruce Merry (JIRA)
Bruce Merry created MESOS-7211:
--

 Summary: Document SUPPRESS HTTP call
 Key: MESOS-7211
 URL: https://issues.apache.org/jira/browse/MESOS-7211
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Affects Versions: 1.1.0
Reporter: Bruce Merry
Priority: Minor


The documentation at 
http://mesos.apache.org/documentation/latest/scheduler-http-api/ doesn't list 
the SUPPRESS call as one of the call types, but it does seem to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-5824) Include disk source information in stringification

2017-03-06 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-5824:
--

Assignee: (was: Tim Harper)

> Include disk source information in stringification
> --
>
> Key: MESOS-5824
> URL: https://issues.apache.org/jira/browse/MESOS-5824
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Affects Versions: 0.28.2
>Reporter: Tim Harper
>Priority: Minor
>  Labels: mesosphere
>
> Some frameworks (like kafka_mesos) ignore the Source field when trying to 
> reserve an offered mount or path persistent volume; the resulting error 
> message is bewildering:
> {code:none}
> Task uses more resources
> cpus(*):4; mem(*):4096; ports(*):[31000-31000]; disk(kafka, 
> kafka)[kafka_0:data]:960679
> than available
> cpus(*):32; mem(*):256819;  ports(*):[31000-32000]; disk(kafka, 
> kafka)[kafka_0:data]:960679;   disk(*):240169;
> {code}
> The stringification of disk resources should include source information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5824) Include disk source information in stringification

2017-03-06 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896945#comment-15896945
 ] 

Alexander Rojas commented on MESOS-5824:


Review is closed due to inactivity.

> Include disk source information in stringification
> --
>
> Key: MESOS-5824
> URL: https://issues.apache.org/jira/browse/MESOS-5824
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Affects Versions: 0.28.2
>Reporter: Tim Harper
>Priority: Minor
>  Labels: mesosphere
>
> Some frameworks (like kafka_mesos) ignore the Source field when trying to 
> reserve an offered mount or path persistent volume; the resulting error 
> message is bewildering:
> {code:none}
> Task uses more resources
> cpus(*):4; mem(*):4096; ports(*):[31000-31000]; disk(kafka, 
> kafka)[kafka_0:data]:960679
> than available
> cpus(*):32; mem(*):256819;  ports(*):[31000-32000]; disk(kafka, 
> kafka)[kafka_0:data]:960679;   disk(*):240169;
> {code}
> The stringification of disk resources should include source information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-03-06 Thread Wojciech Sielski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wojciech Sielski updated MESOS-7210:

Summary: MESOS HTTP checks doesn't work when mesos runs with 
--docker_mesos_image ( pid namespace mismatch )  (was: MESOS HTTP checks 
doesn't work when mesos runs with --docker_mesos_image ( pid namespace 
missmatch ))

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace missmatch )

2017-03-06 Thread Wojciech Sielski (JIRA)
Wojciech Sielski created MESOS-7210:
---

 Summary: MESOS HTTP checks doesn't work when mesos runs with 
--docker_mesos_image ( pid namespace missmatch )
 Key: MESOS-7210
 URL: https://issues.apache.org/jira/browse/MESOS-7210
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 1.1.0
 Environment: Ubuntu 16.04.02
Docker version 1.13.1
mesos 1.1.0, runs from container
docker containers  spawned by marathon 1.4.1
Reporter: Wojciech Sielski


When running mesos-slave with option "docker_mesos_image" like:
{code}
--master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
--executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
--docker_stop_timeout=5secs  --gc_delay=1days  
--docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
--work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
{code}

from the container that was started with option "pid: host" like:
{code}
  net:host
  privileged: true
  pid:host
{code}

and example marathon job, that use MESOS_HTTP checks like:
{code}
{
 "id": "python-example-stable",
 "cmd": "python3 -m http.server 8080",
 "mem": 16,
 "cpus": 0.1,
 "instances": 2,
 "container": {
   "type": "DOCKER",
   "docker": {
 "image": "python:alpine",
 "network": "BRIDGE",
 "portMappings": [
{ "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
 ]
   }
 },
 "env": {
   "SERVICE_NAME" : "python"
 },
 "healthChecks": [
   {
 "path": "/",
 "portIndex": 0,
 "protocol": "MESOS_HTTP",
 "gracePeriodSeconds": 30,
 "intervalSeconds": 10,
 "timeoutSeconds": 30,
 "maxConsecutiveFailures": 3
   }
 ]
}
{code}

I see the errors like:
{code}
F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
namespace of task (pid: '13527'): Pid 13527 does not exist
*** Check failure stack trace: ***
@ 0x7f51770b0c1d  google::LogMessage::Fail()
@ 0x7f51770b29d0  google::LogMessage::SendToLog()
@ 0x7f51770b0803  google::LogMessage::Flush()
@ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f517647ce46  
_ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
@ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
@ 0x7f5177068167  process::internal::cloneChild()
@ 0x7f5177065c32  process::subprocess()
@ 0x7f5176481a9d  
mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
@ 0x7f51764831f7  
mesos::internal::health::HealthCheckerProcess::_healthCheck()
@ 0x7f517701f38c  process::ProcessBase::visit()
@ 0x7f517702c8b3  process::ProcessManager::resume()
@ 0x7f517702fb77  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f51754ddc80  (unknown)
@ 0x7f5174cf06ba  start_thread
@ 0x7f5174a2682d  (unknown)
I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as health 
check still in grace period
{code}

Looks like option docker_mesos_image makes, that newly started mesos job is not 
using "pid host" option same as mother container was started, but has his own 
PID namespace (so it doesn't matter if mother container was started with "pid 
host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)