[jira] [Commented] (MESOS-4969) improve overlayfs detection

2017-08-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121035#comment-16121035
 ] 

James Peach commented on MESOS-4969:


If the filesystem is built into the kernel it will appear in 
{{/proc/filesystems}}.

> improve overlayfs detection
> ---
>
> Key: MESOS-4969
> URL: https://issues.apache.org/jira/browse/MESOS-4969
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, storage
>Reporter: James Peach
>Priority: Minor
>
> On my Fedora 23, overlayfs is a module that is not loaded by default 
> (attempting to mount an overlayfs automatically triggers the module loading). 
> However {{mesos-slave}} won't start until I manually load the module since it 
> is not listed in {{/proc/filesystems}} until is it loaded.
> It would be nice if there was a more reliable way to determine overlayfs 
> support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120954#comment-16120954
 ] 

Till Toenshoff commented on MESOS-7874:
---

See also https://issues.apache.org/jira/browse/MESOS-7875 for a hacky example.

> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120921#comment-16120921
 ] 

Till Toenshoff edited comment on MESOS-7874 at 8/10/17 1:28 AM:


We do indeed split this work on hooks and isolators, depending on the 
containerizer used.

The docker containerizer will at some hopefully not too distant point in the 
future be deprecated in favor of the mesos containerizer. The same is true for 
the command executor - we are working towards deprecating it in favor of the 
default executor.

Introducing new hooks is generally something that we are trying to avoid, if 
possible. We do however have a relatively low barrier in changing the signature 
of hooks - so changing a formally blocking hook into a non blocking one 
({{Future<...>}}) is something we have done before.

All this said, it seems we should try to aim for your second option by using 
both an isolator and a hook.


was (Author: tillt):
We do indeed split this work on hooks and isolators, depending on the 
containerizer used.

The docker containerizer will at some hopefully not too distant point in the 
future be deprecated in favor of the mesos containerizer. The same is true for 
the command executor - we are working towards deprecating it in favor of the 
default executor.

Introducing new hooks is generally something that we are trying to avoid, if 
possible. We do however have a relatively low barrier in changing the signature 
of hooks - so changing a formally blocking hook into a non blocking one 
({{Future<...>}) is something we have done before.

All this said, it seems we should try to aim for your second option by using 
both an isolator and a hook.

> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120921#comment-16120921
 ] 

Till Toenshoff commented on MESOS-7874:
---

We do indeed split this work on hooks and isolators, depending on the 
containerizer used.

The docker containerizer will at some hopefully not too distant point in the 
future be deprecated in favor of the mesos containerizer. The same is true for 
the command executor - we are working towards deprecating it in favor of the 
default executor.

Introducing new hooks is generally something that we are trying to avoid, if 
possible. We do however have a relatively low barrier in changing the signature 
of hooks - so changing a formally blocking hook into a non blocking one 
({{Future<...>}) is something we have done before.

All this said, it seems we should try to aim for your second option by using 
both an isolator and a hook.

> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-7874:
--
Shepherd: Till Toenshoff  (was: Till)

> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7875) Consider offering isolator & hook modules as examples.

2017-08-09 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120911#comment-16120911
 ] 

Till Toenshoff commented on MESOS-7875:
---

The following quick hack might be a good start: 
https://github.com/tillt/module_example

> Consider offering isolator & hook modules as examples.
> --
>
> Key: MESOS-7875
> URL: https://issues.apache.org/jira/browse/MESOS-7875
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Till Toenshoff
>Priority: Minor
>  Labels: example, modules
>
> For enhancing the information flow in identical tasks on both the mesos- and 
> the docker-containerizer, developers have to implement a hook module for the 
> docker containerizer and an isolator module for the mesos containerizer. 
> We should consider offering examples doing just this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7875) Consider offering isolator & hook modules as examples.

2017-08-09 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-7875:
-

 Summary: Consider offering isolator & hook modules as examples.
 Key: MESOS-7875
 URL: https://issues.apache.org/jira/browse/MESOS-7875
 Project: Mesos
  Issue Type: Improvement
Reporter: Till Toenshoff
Priority: Minor


For enhancing the information flow in identical tasks on both the mesos- and 
the docker-containerizer, developers have to implement a hook module for the 
docker containerizer and an isolator module for the mesos containerizer. 

We should consider offering examples doing just this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Description: 
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.

  was:
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.


> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Description: 
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.

  was:
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.


> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own 
> problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7874:


 Summary: Provide a consistent non-blocking preLaunch hook
 Key: MESOS-7874
 URL: https://issues.apache.org/jira/browse/MESOS-7874
 Project: Mesos
  Issue Type: Improvement
  Components: modules
Reporter: Zhitao Li
Assignee: Zhitao Li


Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7871) Agent fails assertion during request to '/state'

2017-08-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120820#comment-16120820
 ] 

Greg Mann commented on MESOS-7871:
--

{code}
commit db8d097c9565e9b6f60531f9eb3f993a6c60fd72
Author: Greg Mann 
Date:   Wed Aug 9 10:00:46 2017 -0700

Added a test to verify the fix for a failed agent assertion.

This patch adds 'SlaveTest.GetStateTaskGroupPending', which confirms
the fix for MESOS-7871. The test verifies that requests to the agent's
'/state' endpoint are successful when there are pending tasks on the
agent which were launched as part of a task group.

Review: https://reviews.apache.org/r/61534
{code}
{code}
commit 4f4807394944d23d3a6f79249ce49e2494a88350
Author: Andrei Budnik 
Date:   Wed Aug 9 11:06:40 2017 -0700

Moved task validation from `getExecutorInfo` to `runTask` on agent.

Previously, `getExecutorInfo` was called only in `runTask`, so it
asserted the invariant that a task should have either CommandInfo
or ExecutorInfo set but not both. This is true for individual
tasks, but it is not necessarily true for tasks which are part of a
task group, since the master injects the task group's ExecutorInfo.

Now `getExecutorInfo` is also called to calculate allocated
resources of tasks which might be part of a task group, which could
violate this invariant, so the assertion has been moved.

Review: https://reviews.apache.org/r/61524/
{code}

> Agent fails assertion during request to '/state'
> 
>
> Key: MESOS-7871
> URL: https://issues.apache.org/jira/browse/MESOS-7871
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>  Labels: mesosphere
> Fix For: 1.4.0
>
>
> While processing requests to {{/state}}, the Mesos agent calls 
> {{Framework::allocatedResources()}}, which in turn calls 
> {{Slave::getExecutorInfo()}} on executors associated with the framework's 
> pending tasks.
> In the case of tasks launched as part of task groups, this leads to the 
> failure of the assertion 
> [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985].
>  This means that the check will fail if the agent processes a request to 
> {{/state}} at a time when it has pending tasks launched as part of a task 
> group.
> This assertion should be removed since this helper function is now used with 
> task groups.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7815) Add gauge for master event processing time

2017-08-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7815:
---
Labels: mesosphere metrics observability  (was: mesosphere metrics 
reliability)

> Add gauge for master event processing time
> --
>
> Key: MESOS-7815
> URL: https://issues.apache.org/jira/browse/MESOS-7815
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Bannier
>  Labels: mesosphere, metrics, observability
>
> To diagnose cases where e.g., the master is backlogged, looking at just 
> {{event_queue_messages}} will only tell about the size of the queue, but 
> diagnosing whether this is due to higher message arrival rate or slower 
> processing requires complicated interference with other metrics.
> We should provide metrics to characterize the time it takes to process 
> messages in the queue, optimally with statistics over some window. This would 
> allow better identification of slow requests.
> We should also consider ways to characterizing the arrival rate via some 
> metric with statistics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-1719) Master should persist active frameworks information

2017-08-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1719:
---
Labels: mesosphere reliability  (was: mesosphere)

> Master should persist active frameworks information
> ---
>
> Key: MESOS-1719
> URL: https://issues.apache.org/jira/browse/MESOS-1719
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Yongqiao Wang
>  Labels: mesosphere, reliability
>
> https://issues.apache.org/jira/browse/MESOS-1219 disallows completed 
> frameworks from re-registering with the same framework id, as long as the 
> master doesn't failover.
> This ticket tracks the work for it work across the master failover using 
> registrar.
> There are some open questions that need to be addressed:
> --> Should registry contain framework ids only framework infos.
> For disallowing completed frameworks from re-registering, persisting 
> framework ids is enough. But, if in the future, we want to disallow
> frameworks from re-registering if some parts of framework info
> changed then we need to persist the info too.
> --> How to update the framework info.
>   Currently frameworks are allowed to update framework info while re-
>   registering, but it only takes effect on the master when the master 
> fails 
>   over and on the slave when the slave fails over. How should things 
>change when persist framework info?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7747) Improve metrics around active subscribers.

2017-08-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7747:
---
Labels: mesosphere metrics observability  (was: mesosphere metrics 
reliability)

> Improve metrics around active subscribers.
> --
>
> Key: MESOS-7747
> URL: https://issues.apache.org/jira/browse/MESOS-7747
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere, metrics, observability
>
> Active subscribers to, e.g., Mesos streaming API, may influence Mesos master 
> performance. To improve triaging and having a better understanding of master 
> workload, we should add metrics to track active subscribers, send queue size 
> and so on.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint

2017-08-09 Thread Deepak Goel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Goel updated MESOS-7873:
---
Affects Version/s: 1.4.0

> Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
> -
>
> Key: MESOS-7873
> URL: https://issues.apache.org/jira/browse/MESOS-7873
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.4.0
>Reporter: Deepak Goel
>Assignee: Deepak Goel
>
> The mesos "state" endpoint doesn't expose 
> "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running 
> on mesos to make use of port mapping information in the NetworkInfo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint

2017-08-09 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7873:
--
Component/s: HTTP API

> Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
> -
>
> Key: MESOS-7873
> URL: https://issues.apache.org/jira/browse/MESOS-7873
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.4.0
>Reporter: Deepak Goel
>Assignee: Deepak Goel
>
> The mesos "state" endpoint doesn't expose 
> "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running 
> on mesos to make use of port mapping information in the NetworkInfo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint

2017-08-09 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7873:
--
Component/s: (was: HTTP API)

> Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
> -
>
> Key: MESOS-7873
> URL: https://issues.apache.org/jira/browse/MESOS-7873
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.4.0
>Reporter: Deepak Goel
>Assignee: Deepak Goel
>
> The mesos "state" endpoint doesn't expose 
> "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running 
> on mesos to make use of port mapping information in the NetworkInfo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint

2017-08-09 Thread Deepak Goel (JIRA)
Deepak Goel created MESOS-7873:
--

 Summary: Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos 
`state` endpoint
 Key: MESOS-7873
 URL: https://issues.apache.org/jira/browse/MESOS-7873
 Project: Mesos
  Issue Type: Bug
  Components: network
Reporter: Deepak Goel
Assignee: Deepak Goel


The mesos "state" endpoint doesn't expose 
"ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running on 
mesos to make use of port mapping information in the NetworkInfo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-09 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-7872:
--
Affects Version/s: 1.4.0

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>  Labels: framework, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-09 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-7872:
-

 Summary: Scheduler hang when registration fails (due to bad role)
 Key: MESOS-7872
 URL: https://issues.apache.org/jira/browse/MESOS-7872
 Project: Mesos
  Issue Type: Bug
Reporter: Till Toenshoff


I'm finding that if framework registration fails, the mesos driver client will 
hang indefinitely with the following output:
{noformat}
I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' is 
not a valid role: Role '/test/role/slashes' cannot start with a slash'
I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
{noformat}

I'd have expected one or both of the following:
- SchedulerDriver.run() should have exited with a failed Proto.Status of some 
form
- Scheduler.error() should have been invoked when the "Got error" occurred

Steps to reproduce:
- Launch a scheduler instance, have it register with a known-bad framework 
info. In this case a role containing slashes was used
- Observe that the scheduler continues in a TASK_RUNNING state despite the 
failed registration. From all appearances it looks like the Scheduler 
implementation isn't invoked at all

I'd guess that because this failure happens before framework registration, 
there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7675) Isolate network ports.

2017-08-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072946#comment-16072946
 ] 

James Peach edited comment on MESOS-7675 at 8/9/17 8:47 PM:


Updated review chain:

| [r/61536|https://reviews.apache.org/r/61536] | Added network ports isolator 
socket utilities tests. |
| [r/60593|https://reviews.apache.org/r/60593] | Test the `network/ports` 
isolator recovery. |
| [r/60765|https://reviews.apache.org/r/60765] | Added basic `network/ports` 
isolator tests. |
| [r/60903|https://reviews.apache.org/r/60903] | Added the `network/ports` 
isolator to the Mesos containerizer. |
| [r/60766|https://reviews.apache.org/r/60766] | Ignored containers that join 
CNI networks. |
| [r/60591|https://reviews.apache.org/r/60591] | Optionally isolate only the 
agent network ports. |
| [r/60592|https://reviews.apache.org/r/60592] | Configure the `network/ports` 
isolator watch interval. |
| [r/60496|https://reviews.apache.org/r/60496] | Added socket checking to the 
network ports isolator. |
| [r/60495|https://reviews.apache.org/r/60495] | Added network ports isolator 
listen socket utilities. |
| [r/61538|https://reviews.apache.org/r/61538] | Used common port range 
interval code in the port_mapping isolator. |
| [r/60492|https://reviews.apache.org/r/60492] | Added a `network/ports` 
isolator skeleton. |
| [r/60902|https://reviews.apache.org/r/60902] | Moved the libnl3 configure 
checks into a macro. |
| [r/60836|https://reviews.apache.org/r/60836] | Added IntervalSet to Ranges 
conversion helper declarations. |
| [r/60901|https://reviews.apache.org/r/60901] | Use a consistent preprocessor 
check for ENABLE_PORT_MAPPING_ISOLATOR. |
| [r/60764|https://reviews.apache.org/r/60764] | Refactored isolator dependency 
checking. |
| [r/60494|https://reviews.apache.org/r/60494] | Exposed LinuxLauncher cgroups 
helper. |
| [r/60493|https://reviews.apache.org/r/60493] | Removed diagnostic socket IPv4 
assumptions. |
| [r/60491|https://reviews.apache.org/r/60491] | Captured the inode when 
scanning for sockets. |
| [r/60594|https://reviews.apache.org/r/60594] | Added a`network/ports` 
isolator nested container test. |



was (Author: jamespeach):
Updated review chain:

| [r/60765|https://reviews.apache.org/r/60765] | Add basic `network/ports` 
isolator tests. |
| [r/60766|https://reviews.apache.org/r/60766] | Ignore containers that join 
CNI networks. |
| [r/60594|https://reviews.apache.org/r/60594] | Add a`network/ports` isolator 
nested container test. |
| [r/60593|https://reviews.apache.org/r/60593] | Test the `network/ports` 
isolator recovery. |
| [r/60592|https://reviews.apache.org/r/60592] | Configure the `network/ports` 
isolator watch interval. |
| [r/60591|https://reviews.apache.org/r/60591] | Optionally isolate only the 
agent network ports. |
| [r/60496|https://reviews.apache.org/r/60496] | Add socket checking to the 
network ports isolator. |
| [r/60495|https://reviews.apache.org/r/60495] | Network ports isolator listen 
socket utilities. |
| [r/60767|https://reviews.apache.org/r/60767] | Allow `network/ports` to 
co-exist with other network isolators. |
| [r/60764|https://reviews.apache.org/r/60764] | Refactor isolator dependency 
checking. |
| [r/60492|https://reviews.apache.org/r/60492] | Add network/ports isolator 
skeleton. |
| [r/60494|https://reviews.apache.org/r/60494] | Expose LinuxLauncher cgroups 
helper. |
| [r/60493|https://reviews.apache.org/r/60493] | Remove diagnostic socket IPv4 
assumptions. |
| [r/60491|https://reviews.apache.org/r/60491] | Capture the inode when 
scanning for sockets. |

> Isolate network ports.
> --
>
> Key: MESOS-7675
> URL: https://issues.apache.org/jira/browse/MESOS-7675
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> If a task uses network ports, there is no isolator that can enforce that it 
> only listens on the ports that it has resources for. Implement a ports 
> isolator that can limit tasks to listen only on allocated TCP ports.
> Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
> {{ss}} do.
> * Find all the listening TCP sockets (using netlink)
> * Index the sockets by their node (from the netlink information)
> * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} 
> links)
> * For each open socket, check whether its node (given in the link target) in 
> the set of listen sockets that we scanned
> * If the socket is a listening socket and the corresponding PID is in the 
> task, send a resource limitation for the task
> Matching pids to tasks depends on using cgroup isolation, otherwise we would 
> have to build a full process tree, which would be nice to avoid.
> Scanning all the open sockets 

[jira] [Commented] (MESOS-4969) improve overlayfs detection

2017-08-09 Thread Aaron Wood (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120586#comment-16120586
 ] 

Aaron Wood commented on MESOS-4969:
---

Another thought, we are building/using our own kernel for our nodes which will 
have this module built in. Doesn't that mean this /proc check will fail?

> improve overlayfs detection
> ---
>
> Key: MESOS-4969
> URL: https://issues.apache.org/jira/browse/MESOS-4969
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, storage
>Reporter: James Peach
>Priority: Minor
>
> On my Fedora 23, overlayfs is a module that is not loaded by default 
> (attempting to mount an overlayfs automatically triggers the module loading). 
> However {{mesos-slave}} won't start until I manually load the module since it 
> is not listed in {{/proc/filesystems}} until is it loaded.
> It would be nice if there was a more reliable way to determine overlayfs 
> support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4969) improve overlayfs detection

2017-08-09 Thread Aaron Wood (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120582#comment-16120582
 ] 

Aaron Wood commented on MESOS-4969:
---

Why not let the OS load it on demand when needed instead of prematurely 
checking and failing right away? That way no one needs to explicitly enable the 
module and add it to /etc/modules.

> improve overlayfs detection
> ---
>
> Key: MESOS-4969
> URL: https://issues.apache.org/jira/browse/MESOS-4969
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, storage
>Reporter: James Peach
>Priority: Minor
>
> On my Fedora 23, overlayfs is a module that is not loaded by default 
> (attempting to mount an overlayfs automatically triggers the module loading). 
> However {{mesos-slave}} won't start until I manually load the module since it 
> is not listed in {{/proc/filesystems}} until is it loaded.
> It would be nice if there was a more reliable way to determine overlayfs 
> support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7814) Improve the test frameworks.

2017-08-09 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-7814:
--
Sprint: Mesosphere Sprint 61

> Improve the test frameworks.
> 
>
> Key: MESOS-7814
> URL: https://issues.apache.org/jira/browse/MESOS-7814
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: mesosphere, newbie
>
> These improvements include three main points:
> * Adding a {{name}} flag to certain frameworks to distinguish between 
> instances.
> * Cleaning up the code style of the frameworks.
> * For frameworks with custom executors, such as balloon framework, adding a 
> {{executor_extra_uris}} flag containing URIs that will be passed to the 
> {{command_info}} of the executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-08-09 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120473#comment-16120473
 ] 

Yan Xu commented on MESOS-7714:
---

[~mcypark] did you get a chance to work on this?

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5078) Document TaskStatus reasons

2017-08-09 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-5078:
---
Sprint: Mesosphere Sprint 61

> Document TaskStatus reasons
> ---
>
> Key: MESOS-5078
> URL: https://issues.apache.org/jira/browse/MESOS-5078
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Benno Evers
>  Labels: documentation, mesosphere, newbie++
>
> We should document the possible {{reason}} values that can be found in the 
> {{TaskStatus}} message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7871) Agent fails assertion during request to '/state'

2017-08-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120422#comment-16120422
 ] 

Greg Mann commented on MESOS-7871:
--

Test and comment updates:
https://reviews.apache.org/r/61534/
https://reviews.apache.org/r/61535/

> Agent fails assertion during request to '/state'
> 
>
> Key: MESOS-7871
> URL: https://issues.apache.org/jira/browse/MESOS-7871
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>  Labels: mesosphere
>
> While processing requests to {{/state}}, the Mesos agent calls 
> {{Framework::allocatedResources()}}, which in turn calls 
> {{Slave::getExecutorInfo()}} on executors associated with the framework's 
> pending tasks.
> In the case of tasks launched as part of task groups, this leads to the 
> failure of the assertion 
> [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985].
>  This means that the check will fail if the agent processes a request to 
> {{/state}} at a time when it has pending tasks launched as part of a task 
> group.
> This assertion should be removed since this helper function is now used with 
> task groups.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7869) Build fails with `--disable-zlib` or `--with-zlib=DIR`

2017-08-09 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7869:
---
Story Points: 1  (was: 2)

> Build fails with `--disable-zlib` or `--with-zlib=DIR`
> --
>
> Key: MESOS-7869
> URL: https://issues.apache.org/jira/browse/MESOS-7869
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.4.0
>
>
> ZLib has been a required library for Mesos and libprocess, and 
> {{--disable-zlib}} is not working anymore so should be removed.
> Also, when {{--with-zlib=DIR}} is specified, the protobuf build will fail 
> because it does not support specifying a customized zlib path through 
> {{--with-zlib}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7023) IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky

2017-08-09 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7023:
--
Affects Version/s: 1.2.2

> IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky
> ---
>
> Key: MESOS-7023
> URL: https://issues.apache.org/jira/browse/MESOS-7023
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, test
>Affects Versions: 1.2.2
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, without libevent/SSL
>Reporter: Greg Mann
>Assignee: Kevin Klues
>  Labels: debugging, flaky
> Attachments: IOSwitchboardTest. 
> RecoverThenKillSwitchboardContainerDestroyed.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/containerizer/io_switchboard_tests.cpp:1052: Failure
> Value of: statusFailed->reason()
>   Actual: 1
> Expected: TaskStatus::REASON_IO_SWITCHBOARD_EXITED
> Which is: 27
> {code}
> Find full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky

2017-08-09 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6135:
--
Affects Version/s: 1.2.2

> ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
> --
>
> Key: MESOS-6135
> URL: https://issues.apache.org/jira/browse/MESOS-6135
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1, 1.2.2
> Environment: Ubuntu 14, libev, non-SSL
>Reporter: Greg Mann
>  Labels: logging, mesosphere
>
> Observed in our internal CI:
> {code}
> [19:53:51] :   [Step 10/10] [ RUN  ] 
> ContainerLoggerTest.LOGROTATE_RotateInSandbox
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] 
> Creating default 'local' authorizer
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] 
> Opened db in 8.730166ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] 
> Compacted db in 3.544028ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] 
> Created db iterator in 3678ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] 
> Seeked to beginning of db in 673ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 256ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] 
> Starting replica recovery
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] 
> Replica is in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(177)@172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] 
> Updating replica status to STARTING
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] 
> Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) 
> started on 172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs"
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] 
> Master only allowing authenticated frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] 
> Master only allowing authenticated agents to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] 
> Master only allowing authenticated HTTP frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/ceLmd7/credentials'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using 
> default 'crammd5' authenticator
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474097 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474161 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474242 23747 master.cpp:583] 
> Authorization enabled
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474308 23744 hierarchical.cpp:149] 
> Initialized hierarchical 

[jira] [Commented] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

2017-08-09 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120104#comment-16120104
 ] 

Alexander Rukletsov commented on MESOS-6743:


In case of an error, docker daemon and the container itself might be fine and 
operating normally, it’s the communication between mesos and the daemon which 
is broken. All docker stop failures are supposed to return [non-zero exit 
code|https://github.com/spf13/cobra/blob/9c28e4bbd74e5c3ed7aacbc552b2cab7cfdfe744/cobra/cmd/init.go#L187],
 even though docker docs [say 
nothing|https://docs.docker.com/engine/reference/commandline/stop] about it. It 
looks like we cannot reliably distinguish why the command fails, errors may 
originate [in the 
client|https://github.com/moby/moby/blob/e9cd2fef805c8182b719d489967fb4d1aa34eecd/client/request.go#L41]
 or [in the 
daemon|https://github.com/moby/moby/blob/77c9728847358a3ed3581d828fb0753017e1afd3/daemon/stop.go#L44].
 Hence the container might or might not have received the signal (the logs we 
have hint the latter).

In case of hanging (or timing out) commands, docker daemon likely malfunctions, 
while the container might or might not be fine and operating normally. Here is 
what we can do.

h3. Exit docker executor
+ This will trigger a terminal update for the task, allowing Marathon to start 
a new instance. 
\- The container might still be OK and running, but unknown to both Mesos and 
Marathon, which might be a problem for some apps.
\- The container becomes orphaned and consumes unallocated resources, until the 
next agent restart (if {{--docker_kill_orphans}} is set).

h3. Forcible kill task
Call {{os::killtree()}} or {{os::kill()}} on the container pid and then exit 
the executor.
+ This will trigger a terminal update for the task, allowing Marathon to start 
a new instance.
+ The container is not orphaned.
\- Might kill irrelevant process due to pid race.
\- Task’s kill policy might be violated since the container is not given enough 
time to terminate gracefully. This is particularly concerning especially if the 
daemon and the container are operating normally. It makes more sense in case of 
the timeout, see e.g. [https://reviews.apache.org/r/44571/].

We can implement our own escalation logic without using docker commands, 
similar to what we [do for the command 
executor|https://github.com/apache/mesos/blob/85af46f93d5625006d01bdcf78bba9fa547b3313/src/launcher/executor.cpp#L850-L871].
 However, this does not look right to me, since docker executor is supposed to 
rely on docker cli for task management.

h3. Retry
+ The container is not orphaned.
\- Until after the task is killed, Marathon sees it as {{TASK_KILLING}} and 
hence will not start other instances.
\- If retry is not successful for some time (which?) or attempts (how many), 
what shall we do?
\- If docker commands are hanging, make sure they are terminated properly.

h3. Let the framework retry
If {{docker stop}} fails, cancel the kill, maybe restart health checking, and 
send the scheduler {{TASK_RUNNING}} update.
+ The container is not orphaned.
+ A running healthy container will be treated as one and no actions from 
Marathon will be necessary (given health checks are restarted).
\- {{TASK_RUNNING}} after {{TASK_KILLING}} is probably confusing for framework 
authors.
\- Might require changes in Marathon and other frameworks that rely on docker 
executor.
\- What if the kill was issued not by the failing health check, but by the 
framework? Do we need a mechanism to forcefully kill the container?

Since docker executor is supposed to delegate all the commands to the docker 
daemon, this seems like the least surprising option. If docker is misbehaving 
on an agent, the docker executor does not try to workaround it, but relays the 
errors further to the framework.

h3. Kill the agent
+ The container is not orphaned.
+ The agent does not try to operate until after it (and its executors) can 
communicate with docker properly.
\- Requires non-trivial changes to the code.
\- Might be a surprising and undesirable behaviour to operators, especially if 
there are non-docker-executor workloads on the agent.

h3. Only transition to {{TASK_KILLING}} on successful stop
Pause the (health) checks, run {{docker stop}}, and send a {{TASK_KILLING}} 
update only if the command exited with status code 0, otherwise resume (health) 
checking. Frameworks should then assume that the kill order got lost/failed and 
retry.
+ Should require no changes to properly written frameworks.
+ The container is not orphaned.
+ A running healthy container will be treated as one and no actions from 
Marathon will be necessary.
\- Some frameworks might have to be updated if they don’t retry kills.
\- Since {{docker stop}} command does not exit until after the task finishes 
(imagine a task with a long grace period), this will effectively defeat the 
purpose of {{TASK_KILLING}}.

> 

[jira] [Assigned] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-09 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet reassigned MESOS-7586:
-

Assignee: Armand Grillet

> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog. This leads to 
> message like:
> “WARNING: Logging before InitGoogleLogging() is written to STDERR”
> Also, messages via glog before logging is initialized might not end up in a 
> logdir.
>  
> The solution might be:
> cout/cerr should be used before logging initialization.
> glog should be used after logging initialization.
>  
> Usually, main function has pattern like:
> 1. load = flags.load(argc, argv) // Load flags from command line.
> 2. Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> 3. Check if user passed --help flag to print help message to cout and then 
> exit.
> 4. Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> 5. process::initialize()
> 6. logging::initialize()
> 7. ...
>  
> Steps 2 and 3 should use cout/cerr to eliminate any extra information 
> generated by glog like current time, date and log level.
> It is possible to move step 6 between steps 3 and 4 safely, because 
> logging::initialize() doesn’t depend on process::initialize().
> Some parts of mesos don’t call logging::initialize(). This should also be 
> fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7871) Agent fails assertion during request to '/state'

2017-08-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7871:


Assignee: Andrei Budnik  (was: Greg Mann)

https://reviews.apache.org/r/61524/

> Agent fails assertion during request to '/state'
> 
>
> Key: MESOS-7871
> URL: https://issues.apache.org/jira/browse/MESOS-7871
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>  Labels: mesosphere
>
> While processing requests to {{/state}}, the Mesos agent calls 
> {{Framework::allocatedResources()}}, which in turn calls 
> {{Slave::getExecutorInfo()}} on executors associated with the framework's 
> pending tasks.
> In the case of tasks launched as part of task groups, this leads to the 
> failure of the assertion 
> [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985].
>  This means that the check will fail if the agent processes a request to 
> {{/state}} at a time when it has pending tasks launched as part of a task 
> group.
> This assertion should be removed since this helper function is now used with 
> task groups.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5078) Document TaskStatus reasons

2017-08-09 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5078:
---
Shepherd: Alexander Rukletsov

> Document TaskStatus reasons
> ---
>
> Key: MESOS-5078
> URL: https://issues.apache.org/jira/browse/MESOS-5078
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Benno Evers
>  Labels: documentation, mesosphere, newbie++
>
> We should document the possible {{reason}} values that can be found in the 
> {{TaskStatus}} message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

2017-08-09 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6743:
---
Sprint: Mesosphere Sprint 61

> Docker executor hangs forever if `docker stop` fails.
> -
>
> Key: MESOS-6743
> URL: https://issues.apache.org/jira/browse/MESOS-6743
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.1, 1.1.0, 1.2.1, 1.3.0
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: mesosphere, reliability
>
> If {{docker stop}} finishes with an error status, the executor should catch 
> this and react instead of indefinitely waiting for {{reaped}} to return.
> An interesting question is _how_ to react. Here are possible solutions.
> 1. Retry {{docker stop}}. In this case it is unclear how many times to retry 
> and what to do if {{docker stop}} continues to fail.
> 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. 
> However, in this case it is unclear what status updates we should send: 
> {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill 
> a task? or set a specific reason in {{TASK_KILLING}}?
> 3. Clean up and exit. In this case we should make sure the task container is 
> killed or notify the framework and the operator that the container may still 
> be running.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)