[jira] [Commented] (MESOS-4969) improve overlayfs detection
[ https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121035#comment-16121035 ] James Peach commented on MESOS-4969: If the filesystem is built into the kernel it will appear in {{/proc/filesystems}}. > improve overlayfs detection > --- > > Key: MESOS-4969 > URL: https://issues.apache.org/jira/browse/MESOS-4969 > Project: Mesos > Issue Type: Bug > Components: containerization, storage >Reporter: James Peach >Priority: Minor > > On my Fedora 23, overlayfs is a module that is not loaded by default > (attempting to mount an overlayfs automatically triggers the module loading). > However {{mesos-slave}} won't start until I manually load the module since it > is not listed in {{/proc/filesystems}} until is it loaded. > It would be nice if there was a more reliable way to determine overlayfs > support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120954#comment-16120954 ] Till Toenshoff commented on MESOS-7874: --- See also https://issues.apache.org/jira/browse/MESOS-7875 for a hacky example. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120921#comment-16120921 ] Till Toenshoff edited comment on MESOS-7874 at 8/10/17 1:28 AM: We do indeed split this work on hooks and isolators, depending on the containerizer used. The docker containerizer will at some hopefully not too distant point in the future be deprecated in favor of the mesos containerizer. The same is true for the command executor - we are working towards deprecating it in favor of the default executor. Introducing new hooks is generally something that we are trying to avoid, if possible. We do however have a relatively low barrier in changing the signature of hooks - so changing a formally blocking hook into a non blocking one ({{Future<...>}}) is something we have done before. All this said, it seems we should try to aim for your second option by using both an isolator and a hook. was (Author: tillt): We do indeed split this work on hooks and isolators, depending on the containerizer used. The docker containerizer will at some hopefully not too distant point in the future be deprecated in favor of the mesos containerizer. The same is true for the command executor - we are working towards deprecating it in favor of the default executor. Introducing new hooks is generally something that we are trying to avoid, if possible. We do however have a relatively low barrier in changing the signature of hooks - so changing a formally blocking hook into a non blocking one ({{Future<...>}) is something we have done before. All this said, it seems we should try to aim for your second option by using both an isolator and a hook. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120921#comment-16120921 ] Till Toenshoff commented on MESOS-7874: --- We do indeed split this work on hooks and isolators, depending on the containerizer used. The docker containerizer will at some hopefully not too distant point in the future be deprecated in favor of the mesos containerizer. The same is true for the command executor - we are working towards deprecating it in favor of the default executor. Introducing new hooks is generally something that we are trying to avoid, if possible. We do however have a relatively low barrier in changing the signature of hooks - so changing a formally blocking hook into a non blocking one ({{Future<...>}) is something we have done before. All this said, it seems we should try to aim for your second option by using both an isolator and a hook. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-7874: -- Shepherd: Till Toenshoff (was: Till) > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7875) Consider offering isolator & hook modules as examples.
[ https://issues.apache.org/jira/browse/MESOS-7875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120911#comment-16120911 ] Till Toenshoff commented on MESOS-7875: --- The following quick hack might be a good start: https://github.com/tillt/module_example > Consider offering isolator & hook modules as examples. > -- > > Key: MESOS-7875 > URL: https://issues.apache.org/jira/browse/MESOS-7875 > Project: Mesos > Issue Type: Improvement >Reporter: Till Toenshoff >Priority: Minor > Labels: example, modules > > For enhancing the information flow in identical tasks on both the mesos- and > the docker-containerizer, developers have to implement a hook module for the > docker containerizer and an isolator module for the mesos containerizer. > We should consider offering examples doing just this. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7875) Consider offering isolator & hook modules as examples.
Till Toenshoff created MESOS-7875: - Summary: Consider offering isolator & hook modules as examples. Key: MESOS-7875 URL: https://issues.apache.org/jira/browse/MESOS-7875 Project: Mesos Issue Type: Improvement Reporter: Till Toenshoff Priority: Minor For enhancing the information flow in identical tasks on both the mesos- and the docker-containerizer, developers have to implement a hook module for the docker containerizer and an isolator module for the mesos containerizer. We should consider offering examples doing just this. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Description: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. was: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Description: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. was: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own > problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
Zhitao Li created MESOS-7874: Summary: Provide a consistent non-blocking preLaunch hook Key: MESOS-7874 URL: https://issues.apache.org/jira/browse/MESOS-7874 Project: Mesos Issue Type: Improvement Components: modules Reporter: Zhitao Li Assignee: Zhitao Li Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7871) Agent fails assertion during request to '/state'
[ https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120820#comment-16120820 ] Greg Mann commented on MESOS-7871: -- {code} commit db8d097c9565e9b6f60531f9eb3f993a6c60fd72 Author: Greg Mann Date: Wed Aug 9 10:00:46 2017 -0700 Added a test to verify the fix for a failed agent assertion. This patch adds 'SlaveTest.GetStateTaskGroupPending', which confirms the fix for MESOS-7871. The test verifies that requests to the agent's '/state' endpoint are successful when there are pending tasks on the agent which were launched as part of a task group. Review: https://reviews.apache.org/r/61534 {code} {code} commit 4f4807394944d23d3a6f79249ce49e2494a88350 Author: Andrei Budnik Date: Wed Aug 9 11:06:40 2017 -0700 Moved task validation from `getExecutorInfo` to `runTask` on agent. Previously, `getExecutorInfo` was called only in `runTask`, so it asserted the invariant that a task should have either CommandInfo or ExecutorInfo set but not both. This is true for individual tasks, but it is not necessarily true for tasks which are part of a task group, since the master injects the task group's ExecutorInfo. Now `getExecutorInfo` is also called to calculate allocated resources of tasks which might be part of a task group, which could violate this invariant, so the assertion has been moved. Review: https://reviews.apache.org/r/61524/ {code} > Agent fails assertion during request to '/state' > > > Key: MESOS-7871 > URL: https://issues.apache.org/jira/browse/MESOS-7871 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Assignee: Andrei Budnik > Labels: mesosphere > Fix For: 1.4.0 > > > While processing requests to {{/state}}, the Mesos agent calls > {{Framework::allocatedResources()}}, which in turn calls > {{Slave::getExecutorInfo()}} on executors associated with the framework's > pending tasks. > In the case of tasks launched as part of task groups, this leads to the > failure of the assertion > [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985]. > This means that the check will fail if the agent processes a request to > {{/state}} at a time when it has pending tasks launched as part of a task > group. > This assertion should be removed since this helper function is now used with > task groups. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7815) Add gauge for master event processing time
[ https://issues.apache.org/jira/browse/MESOS-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7815: --- Labels: mesosphere metrics observability (was: mesosphere metrics reliability) > Add gauge for master event processing time > -- > > Key: MESOS-7815 > URL: https://issues.apache.org/jira/browse/MESOS-7815 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Bannier > Labels: mesosphere, metrics, observability > > To diagnose cases where e.g., the master is backlogged, looking at just > {{event_queue_messages}} will only tell about the size of the queue, but > diagnosing whether this is due to higher message arrival rate or slower > processing requires complicated interference with other metrics. > We should provide metrics to characterize the time it takes to process > messages in the queue, optimally with statistics over some window. This would > allow better identification of slow requests. > We should also consider ways to characterizing the arrival rate via some > metric with statistics. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-1719) Master should persist active frameworks information
[ https://issues.apache.org/jira/browse/MESOS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1719: --- Labels: mesosphere reliability (was: mesosphere) > Master should persist active frameworks information > --- > > Key: MESOS-1719 > URL: https://issues.apache.org/jira/browse/MESOS-1719 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Vinod Kone >Assignee: Yongqiao Wang > Labels: mesosphere, reliability > > https://issues.apache.org/jira/browse/MESOS-1219 disallows completed > frameworks from re-registering with the same framework id, as long as the > master doesn't failover. > This ticket tracks the work for it work across the master failover using > registrar. > There are some open questions that need to be addressed: > --> Should registry contain framework ids only framework infos. > For disallowing completed frameworks from re-registering, persisting > framework ids is enough. But, if in the future, we want to disallow > frameworks from re-registering if some parts of framework info > changed then we need to persist the info too. > --> How to update the framework info. > Currently frameworks are allowed to update framework info while re- > registering, but it only takes effect on the master when the master > fails > over and on the slave when the slave fails over. How should things >change when persist framework info? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7747) Improve metrics around active subscribers.
[ https://issues.apache.org/jira/browse/MESOS-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7747: --- Labels: mesosphere metrics observability (was: mesosphere metrics reliability) > Improve metrics around active subscribers. > -- > > Key: MESOS-7747 > URL: https://issues.apache.org/jira/browse/MESOS-7747 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere, metrics, observability > > Active subscribers to, e.g., Mesos streaming API, may influence Mesos master > performance. To improve triaging and having a better understanding of master > workload, we should add metrics to track active subscribers, send queue size > and so on. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
[ https://issues.apache.org/jira/browse/MESOS-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepak Goel updated MESOS-7873: --- Affects Version/s: 1.4.0 > Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint > - > > Key: MESOS-7873 > URL: https://issues.apache.org/jira/browse/MESOS-7873 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.4.0 >Reporter: Deepak Goel >Assignee: Deepak Goel > > The mesos "state" endpoint doesn't expose > "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running > on mesos to make use of port mapping information in the NetworkInfo. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
[ https://issues.apache.org/jira/browse/MESOS-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7873: -- Component/s: HTTP API > Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint > - > > Key: MESOS-7873 > URL: https://issues.apache.org/jira/browse/MESOS-7873 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.4.0 >Reporter: Deepak Goel >Assignee: Deepak Goel > > The mesos "state" endpoint doesn't expose > "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running > on mesos to make use of port mapping information in the NetworkInfo. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
[ https://issues.apache.org/jira/browse/MESOS-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7873: -- Component/s: (was: HTTP API) > Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint > - > > Key: MESOS-7873 > URL: https://issues.apache.org/jira/browse/MESOS-7873 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.4.0 >Reporter: Deepak Goel >Assignee: Deepak Goel > > The mesos "state" endpoint doesn't expose > "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running > on mesos to make use of port mapping information in the NetworkInfo. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7873) Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint
Deepak Goel created MESOS-7873: -- Summary: Expose `ExecutorInfo.ContainerInfo.NetworkInfo` in Mesos `state` endpoint Key: MESOS-7873 URL: https://issues.apache.org/jira/browse/MESOS-7873 Project: Mesos Issue Type: Bug Components: network Reporter: Deepak Goel Assignee: Deepak Goel The mesos "state" endpoint doesn't expose "ExecutorInfo.ContainerInfo.NetworkInfo" which prohibits any service running on mesos to make use of port mapping information in the NetworkInfo. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-7872: -- Affects Version/s: 1.4.0 > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Till Toenshoff > Labels: framework, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
Till Toenshoff created MESOS-7872: - Summary: Scheduler hang when registration fails (due to bad role) Key: MESOS-7872 URL: https://issues.apache.org/jira/browse/MESOS-7872 Project: Mesos Issue Type: Bug Reporter: Till Toenshoff I'm finding that if framework registration fails, the mesos driver client will hang indefinitely with the following output: {noformat} I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' is not a valid role: Role '/test/role/slashes' cannot start with a slash' I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework {noformat} I'd have expected one or both of the following: - SchedulerDriver.run() should have exited with a failed Proto.Status of some form - Scheduler.error() should have been invoked when the "Got error" occurred Steps to reproduce: - Launch a scheduler instance, have it register with a known-bad framework info. In this case a role containing slashes was used - Observe that the scheduler continues in a TASK_RUNNING state despite the failed registration. From all appearances it looks like the Scheduler implementation isn't invoked at all I'd guess that because this failure happens before framework registration, there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7675) Isolate network ports.
[ https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072946#comment-16072946 ] James Peach edited comment on MESOS-7675 at 8/9/17 8:47 PM: Updated review chain: | [r/61536|https://reviews.apache.org/r/61536] | Added network ports isolator socket utilities tests. | | [r/60593|https://reviews.apache.org/r/60593] | Test the `network/ports` isolator recovery. | | [r/60765|https://reviews.apache.org/r/60765] | Added basic `network/ports` isolator tests. | | [r/60903|https://reviews.apache.org/r/60903] | Added the `network/ports` isolator to the Mesos containerizer. | | [r/60766|https://reviews.apache.org/r/60766] | Ignored containers that join CNI networks. | | [r/60591|https://reviews.apache.org/r/60591] | Optionally isolate only the agent network ports. | | [r/60592|https://reviews.apache.org/r/60592] | Configure the `network/ports` isolator watch interval. | | [r/60496|https://reviews.apache.org/r/60496] | Added socket checking to the network ports isolator. | | [r/60495|https://reviews.apache.org/r/60495] | Added network ports isolator listen socket utilities. | | [r/61538|https://reviews.apache.org/r/61538] | Used common port range interval code in the port_mapping isolator. | | [r/60492|https://reviews.apache.org/r/60492] | Added a `network/ports` isolator skeleton. | | [r/60902|https://reviews.apache.org/r/60902] | Moved the libnl3 configure checks into a macro. | | [r/60836|https://reviews.apache.org/r/60836] | Added IntervalSet to Ranges conversion helper declarations. | | [r/60901|https://reviews.apache.org/r/60901] | Use a consistent preprocessor check for ENABLE_PORT_MAPPING_ISOLATOR. | | [r/60764|https://reviews.apache.org/r/60764] | Refactored isolator dependency checking. | | [r/60494|https://reviews.apache.org/r/60494] | Exposed LinuxLauncher cgroups helper. | | [r/60493|https://reviews.apache.org/r/60493] | Removed diagnostic socket IPv4 assumptions. | | [r/60491|https://reviews.apache.org/r/60491] | Captured the inode when scanning for sockets. | | [r/60594|https://reviews.apache.org/r/60594] | Added a`network/ports` isolator nested container test. | was (Author: jamespeach): Updated review chain: | [r/60765|https://reviews.apache.org/r/60765] | Add basic `network/ports` isolator tests. | | [r/60766|https://reviews.apache.org/r/60766] | Ignore containers that join CNI networks. | | [r/60594|https://reviews.apache.org/r/60594] | Add a`network/ports` isolator nested container test. | | [r/60593|https://reviews.apache.org/r/60593] | Test the `network/ports` isolator recovery. | | [r/60592|https://reviews.apache.org/r/60592] | Configure the `network/ports` isolator watch interval. | | [r/60591|https://reviews.apache.org/r/60591] | Optionally isolate only the agent network ports. | | [r/60496|https://reviews.apache.org/r/60496] | Add socket checking to the network ports isolator. | | [r/60495|https://reviews.apache.org/r/60495] | Network ports isolator listen socket utilities. | | [r/60767|https://reviews.apache.org/r/60767] | Allow `network/ports` to co-exist with other network isolators. | | [r/60764|https://reviews.apache.org/r/60764] | Refactor isolator dependency checking. | | [r/60492|https://reviews.apache.org/r/60492] | Add network/ports isolator skeleton. | | [r/60494|https://reviews.apache.org/r/60494] | Expose LinuxLauncher cgroups helper. | | [r/60493|https://reviews.apache.org/r/60493] | Remove diagnostic socket IPv4 assumptions. | | [r/60491|https://reviews.apache.org/r/60491] | Capture the inode when scanning for sockets. | > Isolate network ports. > -- > > Key: MESOS-7675 > URL: https://issues.apache.org/jira/browse/MESOS-7675 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > If a task uses network ports, there is no isolator that can enforce that it > only listens on the ports that it has resources for. Implement a ports > isolator that can limit tasks to listen only on allocated TCP ports. > Roughly, the algorithm for this follows what standard tools like {{lsof}} and > {{ss}} do. > * Find all the listening TCP sockets (using netlink) > * Index the sockets by their node (from the netlink information) > * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} > links) > * For each open socket, check whether its node (given in the link target) in > the set of listen sockets that we scanned > * If the socket is a listening socket and the corresponding PID is in the > task, send a resource limitation for the task > Matching pids to tasks depends on using cgroup isolation, otherwise we would > have to build a full process tree, which would be nice to avoid. > Scanning all th
[jira] [Commented] (MESOS-4969) improve overlayfs detection
[ https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120586#comment-16120586 ] Aaron Wood commented on MESOS-4969: --- Another thought, we are building/using our own kernel for our nodes which will have this module built in. Doesn't that mean this /proc check will fail? > improve overlayfs detection > --- > > Key: MESOS-4969 > URL: https://issues.apache.org/jira/browse/MESOS-4969 > Project: Mesos > Issue Type: Bug > Components: containerization, storage >Reporter: James Peach >Priority: Minor > > On my Fedora 23, overlayfs is a module that is not loaded by default > (attempting to mount an overlayfs automatically triggers the module loading). > However {{mesos-slave}} won't start until I manually load the module since it > is not listed in {{/proc/filesystems}} until is it loaded. > It would be nice if there was a more reliable way to determine overlayfs > support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-4969) improve overlayfs detection
[ https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120582#comment-16120582 ] Aaron Wood commented on MESOS-4969: --- Why not let the OS load it on demand when needed instead of prematurely checking and failing right away? That way no one needs to explicitly enable the module and add it to /etc/modules. > improve overlayfs detection > --- > > Key: MESOS-4969 > URL: https://issues.apache.org/jira/browse/MESOS-4969 > Project: Mesos > Issue Type: Bug > Components: containerization, storage >Reporter: James Peach >Priority: Minor > > On my Fedora 23, overlayfs is a module that is not loaded by default > (attempting to mount an overlayfs automatically triggers the module loading). > However {{mesos-slave}} won't start until I manually load the module since it > is not listed in {{/proc/filesystems}} until is it loaded. > It would be nice if there was a more reliable way to determine overlayfs > support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7814) Improve the test frameworks.
[ https://issues.apache.org/jira/browse/MESOS-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet updated MESOS-7814: -- Sprint: Mesosphere Sprint 61 > Improve the test frameworks. > > > Key: MESOS-7814 > URL: https://issues.apache.org/jira/browse/MESOS-7814 > Project: Mesos > Issue Type: Improvement > Components: framework >Reporter: Armand Grillet >Assignee: Armand Grillet >Priority: Minor > Labels: mesosphere, newbie > > These improvements include three main points: > * Adding a {{name}} flag to certain frameworks to distinguish between > instances. > * Cleaning up the code style of the frameworks. > * For frameworks with custom executors, such as balloon framework, adding a > {{executor_extra_uris}} flag containing URIs that will be passed to the > {{command_info}} of the executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement
[ https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120473#comment-16120473 ] Yan Xu commented on MESOS-7714: --- [~mcypark] did you get a chance to work on this? > Fix agent downgrade for reservation refinement > -- > > Key: MESOS-7714 > URL: https://issues.apache.org/jira/browse/MESOS-7714 > Project: Mesos > Issue Type: Bug >Reporter: Michael Park >Assignee: Michael Park >Priority: Blocker > > The agent code only partially supports downgrading of an agent correctly. > The checkpointed resources are done correctly, but the resources within > the {{SlaveInfo}} message as well as tasks and executors also need to be > downgraded > correctly and converted back on recovery. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-5078) Document TaskStatus reasons
[ https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-5078: --- Sprint: Mesosphere Sprint 61 > Document TaskStatus reasons > --- > > Key: MESOS-5078 > URL: https://issues.apache.org/jira/browse/MESOS-5078 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Benno Evers > Labels: documentation, mesosphere, newbie++ > > We should document the possible {{reason}} values that can be found in the > {{TaskStatus}} message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7871) Agent fails assertion during request to '/state'
[ https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120422#comment-16120422 ] Greg Mann commented on MESOS-7871: -- Test and comment updates: https://reviews.apache.org/r/61534/ https://reviews.apache.org/r/61535/ > Agent fails assertion during request to '/state' > > > Key: MESOS-7871 > URL: https://issues.apache.org/jira/browse/MESOS-7871 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Assignee: Andrei Budnik > Labels: mesosphere > > While processing requests to {{/state}}, the Mesos agent calls > {{Framework::allocatedResources()}}, which in turn calls > {{Slave::getExecutorInfo()}} on executors associated with the framework's > pending tasks. > In the case of tasks launched as part of task groups, this leads to the > failure of the assertion > [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985]. > This means that the check will fail if the agent processes a request to > {{/state}} at a time when it has pending tasks launched as part of a task > group. > This assertion should be removed since this helper function is now used with > task groups. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7869) Build fails with `--disable-zlib` or `--with-zlib=DIR`
[ https://issues.apache.org/jira/browse/MESOS-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7869: --- Story Points: 1 (was: 2) > Build fails with `--disable-zlib` or `--with-zlib=DIR` > -- > > Key: MESOS-7869 > URL: https://issues.apache.org/jira/browse/MESOS-7869 > Project: Mesos > Issue Type: Bug >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Fix For: 1.4.0 > > > ZLib has been a required library for Mesos and libprocess, and > {{--disable-zlib}} is not working anymore so should be removed. > Also, when {{--with-zlib=DIR}} is specified, the protobuf build will fail > because it does not support specifying a customized zlib path through > {{--with-zlib}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7023) IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky
[ https://issues.apache.org/jira/browse/MESOS-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7023: -- Affects Version/s: 1.2.2 > IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky > --- > > Key: MESOS-7023 > URL: https://issues.apache.org/jira/browse/MESOS-7023 > Project: Mesos > Issue Type: Bug > Components: agent, test >Affects Versions: 1.2.2 > Environment: ASF CI, cmake, gcc, Ubuntu 14.04, without libevent/SSL >Reporter: Greg Mann >Assignee: Kevin Klues > Labels: debugging, flaky > Attachments: IOSwitchboardTest. > RecoverThenKillSwitchboardContainerDestroyed.txt > > > This was observed on ASF CI: > {code} > /mesos/src/tests/containerizer/io_switchboard_tests.cpp:1052: Failure > Value of: statusFailed->reason() > Actual: 1 > Expected: TaskStatus::REASON_IO_SWITCHBOARD_EXITED > Which is: 27 > {code} > Find full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
[ https://issues.apache.org/jira/browse/MESOS-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6135: -- Affects Version/s: 1.2.2 > ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky > -- > > Key: MESOS-6135 > URL: https://issues.apache.org/jira/browse/MESOS-6135 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1, 1.2.2 > Environment: Ubuntu 14, libev, non-SSL >Reporter: Greg Mann > Labels: logging, mesosphere > > Observed in our internal CI: > {code} > [19:53:51] : [Step 10/10] [ RUN ] > ContainerLoggerTest.LOGROTATE_RotateInSandbox > [19:53:51]W: [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] > Creating default 'local' authorizer > [19:53:51]W: [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] > Opened db in 8.730166ms > [19:53:51]W: [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] > Compacted db in 3.544028ms > [19:53:51]W: [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] > Created db iterator in 3678ns > [19:53:51]W: [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] > Seeked to beginning of db in 673ns > [19:53:51]W: [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] > Iterated through 0 keys in the db in 256ns > [19:53:51]W: [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [19:53:51]W: [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] > Starting replica recovery > [19:53:51]W: [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] > Replica is in EMPTY status > [19:53:51]W: [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(177)@172.30.2.89:44578 > [19:53:51]W: [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [19:53:51]W: [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] > Updating replica status to STARTING > [19:53:51]W: [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] > Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) > started on 172.30.2.89:44578 > [19:53:51]W: [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" > --registry_strict="true" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs" > [19:53:51]W: [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] > Master only allowing authenticated frameworks to register > [19:53:51]W: [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] > Master only allowing authenticated agents to register > [19:53:51]W: [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] > Master only allowing authenticated HTTP frameworks to register > [19:53:51]W: [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] > Loading credentials for authentication from '/tmp/ceLmd7/credentials' > [19:53:51]W: [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using > default 'crammd5' authenticator > [19:53:51]W: [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [19:53:51]W: [Step 10/10] I0906 19:53:51.474097 23747 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [19:53:51]W: [Step 10/10] I0906 19:53:51.474161 23747 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [19:53:51]W: [Step 10/10] I0906 19:53:51.474242 23747 master.cpp:583] > Authorization enabled > [19:53:51]W: [Step 10/10] I0906 19:53:51.474308 23744 hierarchical.cpp:149] > Initialized hierarchical allo
[jira] [Commented] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.
[ https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120104#comment-16120104 ] Alexander Rukletsov commented on MESOS-6743: In case of an error, docker daemon and the container itself might be fine and operating normally, it’s the communication between mesos and the daemon which is broken. All docker stop failures are supposed to return [non-zero exit code|https://github.com/spf13/cobra/blob/9c28e4bbd74e5c3ed7aacbc552b2cab7cfdfe744/cobra/cmd/init.go#L187], even though docker docs [say nothing|https://docs.docker.com/engine/reference/commandline/stop] about it. It looks like we cannot reliably distinguish why the command fails, errors may originate [in the client|https://github.com/moby/moby/blob/e9cd2fef805c8182b719d489967fb4d1aa34eecd/client/request.go#L41] or [in the daemon|https://github.com/moby/moby/blob/77c9728847358a3ed3581d828fb0753017e1afd3/daemon/stop.go#L44]. Hence the container might or might not have received the signal (the logs we have hint the latter). In case of hanging (or timing out) commands, docker daemon likely malfunctions, while the container might or might not be fine and operating normally. Here is what we can do. h3. Exit docker executor + This will trigger a terminal update for the task, allowing Marathon to start a new instance. \- The container might still be OK and running, but unknown to both Mesos and Marathon, which might be a problem for some apps. \- The container becomes orphaned and consumes unallocated resources, until the next agent restart (if {{--docker_kill_orphans}} is set). h3. Forcible kill task Call {{os::killtree()}} or {{os::kill()}} on the container pid and then exit the executor. + This will trigger a terminal update for the task, allowing Marathon to start a new instance. + The container is not orphaned. \- Might kill irrelevant process due to pid race. \- Task’s kill policy might be violated since the container is not given enough time to terminate gracefully. This is particularly concerning especially if the daemon and the container are operating normally. It makes more sense in case of the timeout, see e.g. [https://reviews.apache.org/r/44571/]. We can implement our own escalation logic without using docker commands, similar to what we [do for the command executor|https://github.com/apache/mesos/blob/85af46f93d5625006d01bdcf78bba9fa547b3313/src/launcher/executor.cpp#L850-L871]. However, this does not look right to me, since docker executor is supposed to rely on docker cli for task management. h3. Retry + The container is not orphaned. \- Until after the task is killed, Marathon sees it as {{TASK_KILLING}} and hence will not start other instances. \- If retry is not successful for some time (which?) or attempts (how many), what shall we do? \- If docker commands are hanging, make sure they are terminated properly. h3. Let the framework retry If {{docker stop}} fails, cancel the kill, maybe restart health checking, and send the scheduler {{TASK_RUNNING}} update. + The container is not orphaned. + A running healthy container will be treated as one and no actions from Marathon will be necessary (given health checks are restarted). \- {{TASK_RUNNING}} after {{TASK_KILLING}} is probably confusing for framework authors. \- Might require changes in Marathon and other frameworks that rely on docker executor. \- What if the kill was issued not by the failing health check, but by the framework? Do we need a mechanism to forcefully kill the container? Since docker executor is supposed to delegate all the commands to the docker daemon, this seems like the least surprising option. If docker is misbehaving on an agent, the docker executor does not try to workaround it, but relays the errors further to the framework. h3. Kill the agent + The container is not orphaned. + The agent does not try to operate until after it (and its executors) can communicate with docker properly. \- Requires non-trivial changes to the code. \- Might be a surprising and undesirable behaviour to operators, especially if there are non-docker-executor workloads on the agent. h3. Only transition to {{TASK_KILLING}} on successful stop Pause the (health) checks, run {{docker stop}}, and send a {{TASK_KILLING}} update only if the command exited with status code 0, otherwise resume (health) checking. Frameworks should then assume that the kill order got lost/failed and retry. + Should require no changes to properly written frameworks. + The container is not orphaned. + A running healthy container will be treated as one and no actions from Marathon will be necessary. \- Some frameworks might have to be updated if they don’t retry kills. \- Since {{docker stop}} command does not exit until after the task finishes (imagine a task with a long grace period), this will effectively defeat the purpose of {{TASK_K
[jira] [Assigned] (MESOS-7586) Make use of cout/cerr and glog consistent.
[ https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet reassigned MESOS-7586: - Assignee: Armand Grillet > Make use of cout/cerr and glog consistent. > -- > > Key: MESOS-7586 > URL: https://issues.apache.org/jira/browse/MESOS-7586 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Budnik >Assignee: Armand Grillet >Priority: Minor > Labels: debugging, log, newbie > > Some parts of mesos use glog before initialization of glog. This leads to > message like: > “WARNING: Logging before InitGoogleLogging() is written to STDERR” > Also, messages via glog before logging is initialized might not end up in a > logdir. > > The solution might be: > cout/cerr should be used before logging initialization. > glog should be used after logging initialization. > > Usually, main function has pattern like: > 1. load = flags.load(argc, argv) // Load flags from command line. > 2. Check if flags are correct, otherwise print error message to cerr and then > exit. > 3. Check if user passed --help flag to print help message to cout and then > exit. > 4. Parsing and setup of environment variables. If this fails, EXIT macro is > used to print error message via glog. > 5. process::initialize() > 6. logging::initialize() > 7. ... > > Steps 2 and 3 should use cout/cerr to eliminate any extra information > generated by glog like current time, date and log level. > It is possible to move step 6 between steps 3 and 4 safely, because > logging::initialize() doesn’t depend on process::initialize(). > Some parts of mesos don’t call logging::initialize(). This should also be > fixed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7871) Agent fails assertion during request to '/state'
[ https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-7871: Assignee: Andrei Budnik (was: Greg Mann) https://reviews.apache.org/r/61524/ > Agent fails assertion during request to '/state' > > > Key: MESOS-7871 > URL: https://issues.apache.org/jira/browse/MESOS-7871 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Assignee: Andrei Budnik > Labels: mesosphere > > While processing requests to {{/state}}, the Mesos agent calls > {{Framework::allocatedResources()}}, which in turn calls > {{Slave::getExecutorInfo()}} on executors associated with the framework's > pending tasks. > In the case of tasks launched as part of task groups, this leads to the > failure of the assertion > [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985]. > This means that the check will fail if the agent processes a request to > {{/state}} at a time when it has pending tasks launched as part of a task > group. > This assertion should be removed since this helper function is now used with > task groups. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-5078) Document TaskStatus reasons
[ https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-5078: --- Shepherd: Alexander Rukletsov > Document TaskStatus reasons > --- > > Key: MESOS-5078 > URL: https://issues.apache.org/jira/browse/MESOS-5078 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Benno Evers > Labels: documentation, mesosphere, newbie++ > > We should document the possible {{reason}} values that can be found in the > {{TaskStatus}} message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.
[ https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6743: --- Sprint: Mesosphere Sprint 61 > Docker executor hangs forever if `docker stop` fails. > - > > Key: MESOS-6743 > URL: https://issues.apache.org/jira/browse/MESOS-6743 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.0.1, 1.1.0, 1.2.1, 1.3.0 >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik >Priority: Critical > Labels: mesosphere, reliability > > If {{docker stop}} finishes with an error status, the executor should catch > this and react instead of indefinitely waiting for {{reaped}} to return. > An interesting question is _how_ to react. Here are possible solutions. > 1. Retry {{docker stop}}. In this case it is unclear how many times to retry > and what to do if {{docker stop}} continues to fail. > 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. > However, in this case it is unclear what status updates we should send: > {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill > a task? or set a specific reason in {{TASK_KILLING}}? > 3. Clean up and exit. In this case we should make sure the task container is > killed or notify the framework and the operator that the container may still > be running. -- This message was sent by Atlassian JIRA (v6.4.14#64029)