[jira] [Updated] (MESOS-7721) Master's agent removal rate limit also applies to agent unreachability.

2017-06-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7721:
---
Priority: Critical  (was: Major)

> Master's agent removal rate limit also applies to agent unreachability.
> ---
>
> Key: MESOS-7721
> URL: https://issues.apache.org/jira/browse/MESOS-7721
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Benjamin Mahler
>Priority: Critical
>
> Currently, the implementation of partition awareness re-uses the 
> {{--agent_removal_rate_limit}} when marking agents as unreachable. This means 
> that partition aware frameworks are exposed to the agent removal rate limit, 
> when they rather would like to see the information immediately and impose 
> their own rate limiting.
> Rather than waiting for non-partition-aware support to be removed (that may 
> not occur for a long time) per MESOS-5948, we should instead fix the 
> implementation so that unreachability does not get gated behind the agent 
> removal rate limiting.
> Marking this as a bug since from the user's perspective it doesn't behave as 
> expected, there should be a separate flag for rate limiting unreachability 
> marking, but likely unreachability marking does not need rate limiting, since 
> the intention was for frameworks to impose their own rate limiting for 
> replacing tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7721) Master's agent removal rate limit also applies to agent unreachability.

2017-06-25 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7721:
--

 Summary: Master's agent removal rate limit also applies to agent 
unreachability.
 Key: MESOS-7721
 URL: https://issues.apache.org/jira/browse/MESOS-7721
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Benjamin Mahler


Currently, the implementation of partition awareness re-uses the 
{{--agent_removal_rate_limit}} when marking agents as unreachable. This means 
that partition aware frameworks are exposed to the agent removal rate limit, 
when they rather would like to see the information immediately and impose their 
own rate limiting.

Rather than waiting for non-partition-aware support to be removed (that may not 
occur for a long time) per MESOS-5948, we should instead fix the implementation 
so that unreachability does not get gated behind the agent removal rate 
limiting.

Marking this as a bug since from the user's perspective it doesn't behave as 
expected, there should be a separate flag for rate limiting unreachability 
marking, but likely unreachability marking does not need rate limiting, since 
the intention was for frameworks to impose their own rate limiting for 
replacing tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7160) Parsing of perf version segfaults

2017-06-25 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062533#comment-16062533
 ] 

James Peach edited comment on MESOS-7160 at 6/26/17 4:53 AM:
-

Ping [~abudnik]. This is probably what is triggering the abort in Mesosphere CI.


was (Author: jamespeach):
Ping [~abudnik] ^^^

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7160) Parsing of perf version segfaults

2017-06-25 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062532#comment-16062532
 ] 

James Peach edited comment on MESOS-7160 at 6/26/17 4:45 AM:
-

I ran into this in a VM dev environment, and one way this can happen is due to 
a race with the use of {{Subprocess::ChildHook::SUPERVISOR}}. If {{perf}} is 
not installed, then the {{execv}} will fail, and this can happen before the 
supervisor parent process calls {{waitpid}}. Then {{waitpid}} returns -1 (with 
{{ESRCH}} and the supervisor calls {{abort()}}.


was (Author: jamespeach):
I looked

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7160) Parsing of perf version segfaults

2017-06-25 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062533#comment-16062533
 ] 

James Peach commented on MESOS-7160:


Ping [~abudnik] ^^^

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7160) Parsing of perf version segfaults

2017-06-25 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062532#comment-16062532
 ] 

James Peach commented on MESOS-7160:


I looked

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7675) Isolate network ports.

2017-06-25 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062527#comment-16062527
 ] 

James Peach commented on MESOS-7675:


{quote}
Also, this seems like we need to perform the algorithm for the lifetime of 
every task running on the agent?
{quote}

Yes, this would behave similarly to the {{posix/disk}} isolator where we 
periodically scan to check the resource usage. I couldn't find any way to get 
netlink notifications on listening sockets.

{quote}
I am assuming this would work only for tasks on the host network.
{quote}

You could do the same algorithm with a network namespace, though it would be a 
bit more involved and int most cases it wouldn't be especially helpful. For now 
I'm only proposing to do this for the host network.

> Isolate network ports.
> --
>
> Key: MESOS-7675
> URL: https://issues.apache.org/jira/browse/MESOS-7675
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> If a task uses network ports, there is no isolator that can enforce that it 
> only listens on the ports that it has resources for. Implement a ports 
> isolator that can limit tasks to listen only on allocated TCP ports.
> Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
> {{ss}} do.
> * Find all the listening TCP sockets (using netlink)
> * Index the sockets by their node (from the netlink information)
> * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} 
> links)
> * For each open socket, check whether its node (given in the link target) in 
> the set of listen sockets that we scanned
> * If the socket is a listening socket and the corresponding PID is in the 
> task, send a resource limitation for the task
> Matching pids to tasks depends on using cgroup isolation, otherwise we would 
> have to build a full process tree, which would be nice to avoid.
> Scanning all the open sockets can be avoided by using the {{net_cls}} 
> isolator with kernel + libnl3 patches to publish the socket classid when we 
> find the listening socket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7675) Isolate network ports.

2017-06-25 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7675:
---
Description: 
If a task uses network ports, there is no isolator that can enforce that it 
only listens on the ports that it has resources for. Implement a ports isolator 
that can limit tasks to listen only on allocated TCP ports.

Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
{{ss}} do.

* Find all the listening TCP sockets (using netlink)
* Index the sockets by their node (from the netlink information)
* Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} links)
* For each open socket, check whether its node (given in the link target) in 
the set of listen sockets that we scanned
* If the socket is a listening socket and the corresponding PID is in the task, 
send a resource limitation for the task

Matching pids to tasks depends on using cgroup isolation, otherwise we would 
have to build a full process tree, which would be nice to avoid.

Scanning all the open sockets can be avoided by using the {{net_cls}} isolator 
with kernel + libnl3 patches to publish the socket classid when we find the 
listening socket.

  was:
If a task uses network ports, there is no isolator that can enforce that it 
only listens on the ports that it has resources for. Implement a ports isolator 
that can limit tasks to listen only on allocated TCP ports.

Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
{{ss}} do.

* Find all the listening TCP sockets (using netlink)
* Index the sockets by their node (from the netlink information)
* Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} links)
* For each open socket, check whether its node (given in the link target) in 
the set of listen sockets that we scanned
* If the socket is a listening socket and the corresponding PID is in the task, 
send a resource limitation for the task

Matching pids to tasks depends on using group isolation, otherwise we would 
have to build a full process tree, which would be nice to avoid.

Scanning all the open sockets can be avoided by using the {{net_cls}} isolator 
with kernel + libnl3 patches to publish the socket classid when we find the 
listening socket.


> Isolate network ports.
> --
>
> Key: MESOS-7675
> URL: https://issues.apache.org/jira/browse/MESOS-7675
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> If a task uses network ports, there is no isolator that can enforce that it 
> only listens on the ports that it has resources for. Implement a ports 
> isolator that can limit tasks to listen only on allocated TCP ports.
> Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
> {{ss}} do.
> * Find all the listening TCP sockets (using netlink)
> * Index the sockets by their node (from the netlink information)
> * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} 
> links)
> * For each open socket, check whether its node (given in the link target) in 
> the set of listen sockets that we scanned
> * If the socket is a listening socket and the corresponding PID is in the 
> task, send a resource limitation for the task
> Matching pids to tasks depends on using cgroup isolation, otherwise we would 
> have to build a full process tree, which would be nice to avoid.
> Scanning all the open sockets can be avoided by using the {{net_cls}} 
> isolator with kernel + libnl3 patches to publish the socket classid when we 
> find the listening socket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7675) Isolate network ports.

2017-06-25 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062520#comment-16062520
 ] 

Avinash Sridharan commented on MESOS-7675:
--

[~jpe...@apache.org] I am assuming this would work only for tasks on the host 
network. Also, this seems like we need to perform the algorithm for the 
lifetime of every task running on the agent? How do you propose we do this. By 
doing a periodic scan?

PS: By group isolation, did you mean cgroup isolation?

> Isolate network ports.
> --
>
> Key: MESOS-7675
> URL: https://issues.apache.org/jira/browse/MESOS-7675
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> If a task uses network ports, there is no isolator that can enforce that it 
> only listens on the ports that it has resources for. Implement a ports 
> isolator that can limit tasks to listen only on allocated TCP ports.
> Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
> {{ss}} do.
> * Find all the listening TCP sockets (using netlink)
> * Index the sockets by their node (from the netlink information)
> * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} 
> links)
> * For each open socket, check whether its node (given in the link target) in 
> the set of listen sockets that we scanned
> * If the socket is a listening socket and the corresponding PID is in the 
> task, send a resource limitation for the task
> Matching pids to tasks depends on using group isolation, otherwise we would 
> have to build a full process tree, which would be nice to avoid.
> Scanning all the open sockets can be avoided by using the {{net_cls}} 
> isolator with kernel + libnl3 patches to publish the socket classid when we 
> find the listening socket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7709) Add --dns flag to the agent.

2017-06-25 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-7709:
--
Description: 
Mesos support both CNI (through `network/cni` isolator) and CNM (through 
docker) specification. Both these specifications allow for DNS entries for 
containers to be set on a per-container, and per-network basis. 

Currently, the behavior of the agent is to use the DNS nameservers set in 
/etc/resolv.conf when the CNI or CNM plugin that is used to attached the 
container to the CNI/CNM network doesnt' explicitly set the DNS for the 
container. This is a bit inflexible especially when we have a mix of v4 and v6 
networks. 

The operator should be able to specify DNS nameservers for the networks he 
installs either the override the ones provided by the plugin or as defaults 
when the plugins are not going to specify DNS name servers.

In order to achieve the above goal we need to introduce a `--dns` flag to the 
agent. The `--dns` flag should support a JSON (or a JSON file) with the 
following schema:
{code}
{
  "mesos": {
[ 
  {
"network" : ,
"nameservers": []
  }
]
  },
  "docker": {
[ 
  {
"network" : ,
"nameservers": []
  }
]
  }
}
{code}

  was:
Mesos support both CNI (through `network/cni` isolator) and CNM (through 
docker) specification. Both these specifications allow for DNS entries for 
containers to be set on a per-container, and per-network basis. 

Currently, the behavior of the agent is to use the DNS nameservers set in 
/etc/resolv.conf when the CNI or CNM plugin that is used to attached the 
container to the CNI/CNM network doesnt' explicitly set the DNS for the 
container. This is a bit inflexible especially when we have a mix of v4 and v6 
networks. 

The operator should be able to specify DNS nameservers for the networks he 
installs either the override the ones provided by the plugin or as defaults 
when the plugins are not going to specify DNS name servers.

In order to achieve the above goal we need to introduce a `--dns` flag to the 
agent. The `--dns` flag should support a JSON (or a JSON file) with the 
following schema:
{
  "mesos": {
 [ 
   { "network" : ,
 "nameservers": []
   }
 ]
  },
  "docker": {
[ 
   { "network" : ,
 "nameservers": []
   }
 ]
  }
}


> Add --dns flag to the agent.
> 
>
> Key: MESOS-7709
> URL: https://issues.apache.org/jira/browse/MESOS-7709
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>
> Mesos support both CNI (through `network/cni` isolator) and CNM (through 
> docker) specification. Both these specifications allow for DNS entries for 
> containers to be set on a per-container, and per-network basis. 
> Currently, the behavior of the agent is to use the DNS nameservers set in 
> /etc/resolv.conf when the CNI or CNM plugin that is used to attached the 
> container to the CNI/CNM network doesnt' explicitly set the DNS for the 
> container. This is a bit inflexible especially when we have a mix of v4 and 
> v6 networks. 
> The operator should be able to specify DNS nameservers for the networks he 
> installs either the override the ones provided by the plugin or as defaults 
> when the plugins are not going to specify DNS name servers.
> In order to achieve the above goal we need to introduce a `--dns` flag to the 
> agent. The `--dns` flag should support a JSON (or a JSON file) with the 
> following schema:
> {code}
> {
>   "mesos": {
> [ 
>   {
> "network" : ,
> "nameservers": []
>   }
> ]
>   },
>   "docker": {
> [ 
>   {
> "network" : ,
> "nameservers": []
>   }
> ]
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)