[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Fix Version/s: 1.1.3

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Fix Version/s: 1.1.3

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Fix Version/s: 1.2.2

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Fix Version/s: 1.2.2

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Fix Version/s: 1.3.1

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.3.1, 1.4.0
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Fix Version/s: 1.3.1

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 1.3.1, 1.4.0
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Summary: Add an agent flag for executor re-registration timeout.  (was: Add 
an agent flag for executor re-register timeout)

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.4.0
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7579) Deprecate GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-7579:
--

 Summary: Deprecate GPU_RESOURCES capability and master flag 
`--filter-gpu-resources={true|false}`
 Key: MESOS-7579
 URL: https://issues.apache.org/jira/browse/MESOS-7579
 Project: Mesos
  Issue Type: Task
  Components: allocation, gpu
Reporter: Kevin Klues


Once we reach Mesos 2.0, we should completely remove the GPU_RESOURCES 
capability and the corresponding {{--filter-gpu-resources}} that controls 
whether the allocator honors this capability or not.
It will have been deprecated once support for {{dynamic reservations}}, 
{{hierarchical roles}}, and {{support for reservations to multiple roles}} has 
landed. The JIRA tracking these features as blockers to this ticket are linked 
below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7577) Remove GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-7577:
---
Target Version/s: 2.1.0
 Description: 
Once we reach Mesos 2.0, we should completely remove the GPU_RESOURCES 
capability and the corresponding {{--filter-gpu-resources}} that controls 
whether the allocator honors this capability or not.
It will have been deprecated once support for {{dynamic reservations}}, 
{{hierarchical roles}}, and {{support for reservations to multiple roles}} has 
landed. The JIRA tracking these features as blockers to this ticket are linked 
below.

  was:
This flag was added as a temporary way to to enable / disable honoring the 
GPU_RESOURCES framework capability. We should remove it once we have better 
support for achieving the same functionality that the GPU_RESOURCES capability 
gives you.

This support relies on dynamic reservations, hierarchical roles, and support 
for reservations to multiple roles (an unyet implemented feature). The JIRA 
tracking these features as blockers to this ticket are linked below.


> Remove GPU_RESOURCES capability and master flag 
> `--filter-gpu-resources={true|false}`
> -
>
> Key: MESOS-7577
> URL: https://issues.apache.org/jira/browse/MESOS-7577
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, gpu
>Reporter: Kevin Klues
>
> Once we reach Mesos 2.0, we should completely remove the GPU_RESOURCES 
> capability and the corresponding {{--filter-gpu-resources}} that controls 
> whether the allocator honors this capability or not.
> It will have been deprecated once support for {{dynamic reservations}}, 
> {{hierarchical roles}}, and {{support for reservations to multiple roles}} 
> has landed. The JIRA tracking these features as blockers to this ticket are 
> linked below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-05-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-7500:
--
Sprint: Mesosphere Sprint 56  (was: Mesosphere Sprint 56, Mesosphere Sprint 
57)

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7577) Remove GPU_RESOURCES capability and remove master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-7577:
---
Summary: Remove GPU_RESOURCES capability and remove master flag 
`--filter-gpu-resources={true|false}`  (was: Remove master flag 
`--filter-gpu-resources={true|false}`)

> Remove GPU_RESOURCES capability and remove master flag 
> `--filter-gpu-resources={true|false}`
> 
>
> Key: MESOS-7577
> URL: https://issues.apache.org/jira/browse/MESOS-7577
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, gpu
>Reporter: Kevin Klues
>
> This flag was added as a temporary way to to enable / disable honoring the 
> GPU_RESOURCES framework capability. We should remove it once we have better 
> support for achieving the same functionality that the GPU_RESOURCES 
> capability gives you.
> This support relies on dynamic reservations, hierarchical roles, and support 
> for reservations to multiple roles (an unyet implemented feature). The JIRA 
> tracking these features as blockers to this ticket are linked below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7578) Write a proposal to make the I/O Switchboards optional

2017-05-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-7578:
--
Labels: check containerizer health-check mesosphere  (was: mesosphere)

> Write a proposal to make the I/O Switchboards optional
> --
>
> Key: MESOS-7578
> URL: https://issues.apache.org/jira/browse/MESOS-7578
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: check, containerizer, health-check, mesosphere
>
> Right now DEBUG containers can only be started using the 
> LaunchNestedContainerSession API call. They will enter its parent’s 
> namespaces, inherit environment variables, stream its I/O, and Mesos will tie 
> their life-cycle to the lifetime of the HTTP connection.
> Streaming the I/O of a container requires an I/O Switchboard and adds some 
> overhead and complexity:
> - Mesos will launch an extra process, called an I/O Switchboard for each 
> nested container. These process aren’t free, they take some time to 
> create/destroy and consume resources.
> - I/O Switchboards are managed by a complex isolator.
> - /O Swichboards introduce new race conditions, and have been a source of 
> deadlocks in the past. 
> Some use cases require some of the features provided by DEBUG containers, but 
> don’t need the functionality provided by the I/O switchboard. For instance, 
> the Default Executor uses DEBUG containers to perform (health)checks, but it 
> doesn’t need to stream anything to/from the container. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7577) Remove GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-7577:
---
Summary: Remove GPU_RESOURCES capability and master flag 
`--filter-gpu-resources={true|false}`  (was: Remove GPU_RESOURCES capability 
and remove master flag `--filter-gpu-resources={true|false}`)

> Remove GPU_RESOURCES capability and master flag 
> `--filter-gpu-resources={true|false}`
> -
>
> Key: MESOS-7577
> URL: https://issues.apache.org/jira/browse/MESOS-7577
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, gpu
>Reporter: Kevin Klues
>
> This flag was added as a temporary way to to enable / disable honoring the 
> GPU_RESOURCES framework capability. We should remove it once we have better 
> support for achieving the same functionality that the GPU_RESOURCES 
> capability gives you.
> This support relies on dynamic reservations, hierarchical roles, and support 
> for reservations to multiple roles (an unyet implemented feature). The JIRA 
> tracking these features as blockers to this ticket are linked below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7578) Write a proposal to make the I/O Switchboards optional

2017-05-26 Thread JIRA
Gastón Kleiman created MESOS-7578:
-

 Summary: Write a proposal to make the I/O Switchboards optional
 Key: MESOS-7578
 URL: https://issues.apache.org/jira/browse/MESOS-7578
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gastón Kleiman
Assignee: Gastón Kleiman


Right now DEBUG containers can only be started using the 
LaunchNestedContainerSession API call. They will enter its parent’s namespaces, 
inherit environment variables, stream its I/O, and Mesos will tie their 
life-cycle to the lifetime of the HTTP connection.

Streaming the I/O of a container requires an I/O Switchboard and adds some 
overhead and complexity:

- Mesos will launch an extra process, called an I/O Switchboard for each nested 
container. These process aren’t free, they take some time to create/destroy and 
consume resources.
- I/O Switchboards are managed by a complex isolator.
- /O Swichboards introduce new race conditions, and have been a source of 
deadlocks in the past. 

Some use cases require some of the features provided by DEBUG containers, but 
don’t need the functionality provided by the I/O switchboard. For instance, the 
Default Executor uses DEBUG containers to perform (health)checks, but it 
doesn’t need to stream anything to/from the container. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7577) Remove master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-7577:
--

 Summary: Remove master flag `--filter-gpu-resources={true|false}`
 Key: MESOS-7577
 URL: https://issues.apache.org/jira/browse/MESOS-7577
 Project: Mesos
  Issue Type: Task
  Components: allocation, gpu
Reporter: Kevin Klues


This flag was added as a temporary way to to enable / disable honoring the 
GPU_RESOURCES framework capability. We should remove it once we have better 
support for achieving the same functionality that the GPU_RESOURCES capability 
gives you.

This support relies on dynamic reservations, hierarchical roles, and support 
for reservations to multiple roles (an unyet implemented feature). The JIRA 
tracking these features as blockers to this ticket are linked below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7576) Add master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-7576:
---
Description: 
Per the email thread below, we are adding a new flag on the master called 
{{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
framework capability.

https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html

When set to {{true}}, this flag will cause the mesos master to continue to
function as it does today. That is, it will filter offers containing GPU
resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} 
framework capability. When set to {{false}}, this flag will cause the master to 
*not* filter offers containing GPU resources, and indiscriminately send them to 
all frameworks whether they set the {{GPU_RESOURCES}} capability or not.

This is a temporary flag that will eventually be removed. We will remove it 
once we have better support for achieving the same functionality that the 
{{GPU_RESOURCES}} capability gives you.

As described in the email, this support relies on {{dynamic reservations}}, 
{{hierarchical roles}}, and support for {{reservations to multiple roles}} (an 
unyet implemented feature).  The JIRA tracking these features are linked below.

  was:
Per the email thread below, we are adding a new flag on the master called 
{{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
framework capability.

https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html

When set to {{true}}, this flag will cause the mesos master to continue to
function as it does today. That is, it will filter offers containing GPU
resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} 
framework capability. When set to {{false}}, this flag will cause the master to 
*not* filter offers containing GPU resources, and indiscriminately send them to 
all frameworks whether they set the {{GPU_RESOURCES}} capability or not.

This is a temporary flag that will eventually be removed. We will remove it 
once we have better support for achieving the same functionality that the 
{{GPU_RESOURCES}} capability gives you.

As described in the email, this support relies {{dynamic reservations}}, 
{{hierarchical roles}}, and support for {{reservations to multiple roles}} (an 
unyet implemented feature).  The JIRA tracking these features are linked below.


> Add master flag `--filter-gpu-resources={true|false}`
> -
>
> Key: MESOS-7576
> URL: https://issues.apache.org/jira/browse/MESOS-7576
> Project: Mesos
>  Issue Type: Task
>  Components: gpu
>Affects Versions: 1.2.0
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>
> Per the email thread below, we are adding a new flag on the master called 
> {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
> framework capability.
> https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html
> When set to {{true}}, this flag will cause the mesos master to continue to
> function as it does today. That is, it will filter offers containing GPU
> resources and only send them to frameworks that opt into the 
> {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will 
> cause the master to *not* filter offers containing GPU resources, and 
> indiscriminately send them to all frameworks whether they set the 
> {{GPU_RESOURCES}} capability or not.
> This is a temporary flag that will eventually be removed. We will remove it 
> once we have better support for achieving the same functionality that the 
> {{GPU_RESOURCES}} capability gives you.
> As described in the email, this support relies on {{dynamic reservations}}, 
> {{hierarchical roles}}, and support for {{reservations to multiple roles}} 
> (an unyet implemented feature).  The JIRA tracking these features are linked 
> below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7576) Add master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-7576:
---
Description: 
Per the email thread below, we are adding a new flag on the master called 
{{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
framework capability.

https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html

When set to {{true}}, this flag will cause the mesos master to continue to
function as it does today. That is, it will filter offers containing GPU
resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} 
framework capability. When set to {{false}}, this flag will cause the master to 
*not* filter offers containing GPU resources, and indiscriminately send them to 
all frameworks whether they set the {{GPU_RESOURCES}} capability or not.

This is a temporary flag that will eventually be removed. We will remove it 
once we have better support for achieving the same functionality that the 
{{GPU_RESOURCES}} capability gives you.

As described in the email, this support relies {{dynamic reservations}}, 
{{hierarchical roles}}, and support for {{reservations to multiple roles}} (an 
unyet implemented feature).  The JIRA tracking these features are linked below.

  was:
Per the email thread below, we are adding a new flag on the master called 
{{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
framework capability.

https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html

When set to {{true}}, this flag will cause the mesos master to continue to
function as it does today. That is, it will filter offers containing GPU
resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} 
framework capability. When set to {{false}}, this flag will cause the master to 
*not* filter offers containing GPU resources, and indiscriminately send them to 
all frameworks whether they set the {{GPU_RESOURCES}} capability or not.

This is a temporary flag that will eventually be removed. We will remove it 
once we have better support for achieving the same functionality that the 
{{GPU_RESOURCES}} capability gives you.

As described in the email, this support relies {{reservations}}, {{hierarchical 
roles}}, and support for {{reservations to multiple roles}} (an unyet 
implemented feature).  The JIRA tracking these features are linked below.


> Add master flag `--filter-gpu-resources={true|false}`
> -
>
> Key: MESOS-7576
> URL: https://issues.apache.org/jira/browse/MESOS-7576
> Project: Mesos
>  Issue Type: Task
>  Components: gpu
>Affects Versions: 1.2.0
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>
> Per the email thread below, we are adding a new flag on the master called 
> {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
> framework capability.
> https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html
> When set to {{true}}, this flag will cause the mesos master to continue to
> function as it does today. That is, it will filter offers containing GPU
> resources and only send them to frameworks that opt into the 
> {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will 
> cause the master to *not* filter offers containing GPU resources, and 
> indiscriminately send them to all frameworks whether they set the 
> {{GPU_RESOURCES}} capability or not.
> This is a temporary flag that will eventually be removed. We will remove it 
> once we have better support for achieving the same functionality that the 
> {{GPU_RESOURCES}} capability gives you.
> As described in the email, this support relies {{dynamic reservations}}, 
> {{hierarchical roles}}, and support for {{reservations to multiple roles}} 
> (an unyet implemented feature).  The JIRA tracking these features are linked 
> below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7575) Support hierarchical reservations

2017-05-26 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026861#comment-16026861
 ] 

Michael Park commented on MESOS-7575:
-

Design doc: 
https://docs.google.com/document/d/1Di6drHrBs3FWYJXKQjCTqQMi2PfdtCrf4OcuP3RZqmk/

> Support hierarchical reservations
> -
>
> Key: MESOS-7575
> URL: https://issues.apache.org/jira/browse/MESOS-7575
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>
> With the introduction of hierarchical roles, Mesos provides a mechanism to 
> delegate resources down a hierarchy. To complement this, we need to introduce 
> a notion of hierarchical reservations so that we can *refine* the 
> reservations down the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7575) Support hierarchical reservations

2017-05-26 Thread Michael Park (JIRA)
Michael Park created MESOS-7575:
---

 Summary: Support hierarchical reservations
 Key: MESOS-7575
 URL: https://issues.apache.org/jira/browse/MESOS-7575
 Project: Mesos
  Issue Type: Bug
Reporter: Michael Park


With the introduction of hierarchical roles, Mesos provides a mechanism to 
delegate resources down a hierarchy. To complement this, we need to introduce a 
notion of hierarchical reservations so that we can *refine* the reservations 
down the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7576) Add master flag `--filter-gpu-resources={true|false}`

2017-05-26 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-7576:
--

 Summary: Add master flag `--filter-gpu-resources={true|false}`
 Key: MESOS-7576
 URL: https://issues.apache.org/jira/browse/MESOS-7576
 Project: Mesos
  Issue Type: Task
  Components: gpu
Affects Versions: 1.2.0
Reporter: Kevin Klues
Assignee: Kevin Klues


Per the email thread below, we are adding a new flag on the master called 
{{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} 
framework capability.

https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html

When set to {{true}}, this flag will cause the mesos master to continue to
function as it does today. That is, it will filter offers containing GPU
resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} 
framework capability. When set to {{false}}, this flag will cause the master to 
*not* filter offers containing GPU resources, and indiscriminately send them to 
all frameworks whether they set the {{GPU_RESOURCES}} capability or not.

This is a temporary flag that will eventually be removed. We will remove it 
once we have better support for achieving the same functionality that the 
{{GPU_RESOURCES}} capability gives you.

As described in the email, this support relies {{reservations}}, {{hierarchical 
roles}}, and support for {{reservations to multiple roles}} (an unyet 
implemented feature).  The JIRA tracking these features are linked below.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7575) Support hierarchical reservations

2017-05-26 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7575:

Sprint: Mesosphere Sprint 57
Issue Type: Task  (was: Bug)

> Support hierarchical reservations
> -
>
> Key: MESOS-7575
> URL: https://issues.apache.org/jira/browse/MESOS-7575
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>
> With the introduction of hierarchical roles, Mesos provides a mechanism to 
> delegate resources down a hierarchy. To complement this, we need to introduce 
> a notion of hierarchical reservations so that we can *refine* the 
> reservations down the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026856#comment-16026856
 ] 

Gastón Kleiman commented on MESOS-7500:
---

The failures seem to be related to the agent not being able to attach to the 
DEBUG container launched by the health checker.

This is however not really necessary for checks, so I created a [design 
document|https://docs.google.com/document/d/1YCMtH8i2-ovTVtKDsCTrXdygS7ieaSrJLVnFbR66qfA/]
 with two proposals that'd make it possible to start DEBUG containers without 
an I/O switchboard.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7574) Allow reservations to multiple roles.

2017-05-26 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7574:
--

 Summary: Allow reservations to multiple roles.
 Key: MESOS-7574
 URL: https://issues.apache.org/jira/browse/MESOS-7574
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


There have been some discussions for allowing reservations to multiple roles 
(or more generally, role expressions).

E.g. All resources on GPU agents are reserved for "eng/machine-learning" or 
"finance/forecasting" or "data-science/modeling" to use, because these are the 
roles in my organization that make use of GPUs, and I want to guarantee that 
none of the non-GPU workloads tie up the GPU machines cpus/mem/disk.

This GPU related example would allow us to deprecate and remove the 
GPU_RESOURCES capability, which is a hack implementation of reservations to 
multiple roles. Mesos will only offer GPU machine resources to GPU capable 
schedulers. Having the ability to make reservations to multiple roles obviates 
this hack.

With hierarchical roles, we have a restricted version of reservations to 
multiple roles, where the roles are restricted to the descendant roles. For 
example, a reservation for "gpu-workloads" can be allocated to 
"gpu-workloads/eng/image-processing",  "gpu-workloads/data-science/modeling", 
"gpu-workloads/finance/forecasting etc. What isn't achievable is a reservation 
to multiple roles across the tree, e.g. "eng/image-processing" OR 
"finance/forecasting" OR "data-science/modeling". This can get clumsy because 
if "eng/ML" wants to get in on the reserved gpus, the user would have to place 
a related role underneath the "gpu-workloads" role, e.g. "gpu-workloads/eng/ML".

A similar use case has been that some agents are "public" and there are 
disparate roles in the organization that need access to these hosts, so we want 
to ensure that only these roles get access and no other roles can tie up the 
resources on these hosts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-05-26 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026712#comment-16026712
 ] 

Zhitao Li commented on MESOS-7566:
--

I suspect this is another manifesting of root cause in MESOS-4553.

A couple of observations:


1 there is always a revocable resource decrease as well as a UNRESERVE 
operation before crash;
2 DRFSorter somehow gets updated with the newer (and smaller value) in its 
total_ but somehow still asked to remove an older value, thus code crashed;
3 The reason about 2 is possibly a race condition between master and 
hierarchical process queue (unfortunately, without a coredump or verbose 
logging, this is still pretty hard to diagnose further based on my knowledge of 
the codebase, as there are still multiple code paths leading the crash)

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-4210) Investigate increasing protobuf protocol message size limit.

2017-05-26 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026526#comment-16026526
 ] 

Anand Mazumdar commented on MESOS-4210:
---

This shouldn't be a concern anymore after the upgrade to proto3 as part of 
MESOS-7228 that supports message sizes up to 2GB. Marking it as resolved.

> Investigate increasing protobuf protocol message size limit.
> 
>
> Key: MESOS-4210
> URL: https://issues.apache.org/jira/browse/MESOS-4210
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
> Fix For: 1.4.0
>
>
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:171] A protocol message 
> was rejected because it was too big (more than 67108864 bytes). To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. 
> F20151217 16:33:44.832834 4076 construct.cpp:48] Check failed: parsed 
> Unexpected failure while parsing protobuf
> Check failure stack trace: ***
> @ 0x2b9bab353b68 (unknown) 
> @ 0x2b9bab353ac4 (unknown) 
> @ 0x2b9bab3534ba (unknown) 
> @ 0x2b9bab356274 (unknown) 
> @ 0x2b9bab339d09 (unknown) 
> @ 0x2b9bab338917 (unknown) 
> @ 0x2b9bab33f404 (unknown) 
> @ 0x2b9b68350e18 (unknown) 
> {noformat}
> The error is presumably caused by a "user sending a very large command line".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7017) HTTP API responses can crash the master.

2017-05-26 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026525#comment-16026525
 ] 

Anand Mazumdar commented on MESOS-7017:
---

This should be fixed via MESOS-7228. Marking it as resolved.

> HTTP API responses can crash the master.
> 
>
> Key: MESOS-7017
> URL: https://issues.apache.org/jira/browse/MESOS-7017
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: James Peach
>Priority: Critical
> Fix For: 1.4.0
>
>
> The master can crash when generating large responses to small API requests. 
> One manifestation of this is querying the tasks.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
> F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
> t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
> evolving from mesos.master.Response
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7573) Fix /profiler endpoint to use perf

2017-05-26 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026429#comment-16026429
 ] 

Zhitao Li commented on MESOS-7573:
--

[~bmahler], do you have time to shephard this? I need to get some perf 
endpoints working anyway so I'm willing to take this work.

> Fix /profiler endpoint to use perf
> --
>
> Key: MESOS-7573
> URL: https://issues.apache.org/jira/browse/MESOS-7573
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>
> Right now, the [ profiler | 
> http://mesos.apache.org/documentation/latest/endpoints/profiler/start/ ] 
> endpoints seem pretty broken (I can't even generate a working build from 
> master).
> Based on a slack conversation with [~bmahler], that endpoint was added when [ 
> linux perf | https://perf.wiki.kernel.org/index.php/Main_Page ] was not 
> available yet in old centos. [~bmahler] suggests that we replace gperftools 
> with linux perf, and probably fix this endpoint to automatically generate 
> framegraphs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7573) Fix /profiler endpoint to use perf

2017-05-26 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7573:


 Summary: Fix /profiler endpoint to use perf
 Key: MESOS-7573
 URL: https://issues.apache.org/jira/browse/MESOS-7573
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


Right now, the [ profiler | 
http://mesos.apache.org/documentation/latest/endpoints/profiler/start/ ] 
endpoints seem pretty broken (I can't even generate a working build from 
master).

Based on a slack conversation with [~bmahler], that endpoint was added when [ 
linux perf | https://perf.wiki.kernel.org/index.php/Main_Page ] was not 
available yet in old centos. [~bmahler] suggests that we replace gperftools 
with linux perf, and probably fix this endpoint to automatically generate 
framegraphs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7572) Follow symlinks in the various master/agent endpoints

2017-05-26 Thread Aaron Wood (JIRA)
Aaron Wood created MESOS-7572:
-

 Summary: Follow symlinks in the various master/agent endpoints
 Key: MESOS-7572
 URL: https://issues.apache.org/jira/browse/MESOS-7572
 Project: Mesos
  Issue Type: Improvement
  Components: agent, HTTP API, master
Reporter: Aaron Wood
Assignee: Aaron Wood


The main benefit of following symlinks in endpoints such as {code}/files{code} 
is that frameworks will be able to construct a path to the sandbox much easier. 
This will assist framework developers in making features that need to provide a 
path when hitting various operator API endpoints. Currently, making use of a 
path ending in {code}runs/latest{code} throws a 404.

One such application could be a scheduler providing the ability for users to 
work with their task's sandbox directly without going to the Mesos UI, 
endpoints, or the actual system themselves.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6961) Executors don't use glog for logging.

2017-05-26 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-6961:


Assignee: Andrei Budnik

> Executors don't use glog for logging.
> -
>
> Key: MESOS-6961
> URL: https://issues.apache.org/jira/browse/MESOS-6961
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: log, mesosphere, newbie++
>
> Built-in Mesos executors use {{cout}}/{{cerr}} for logging. This is not only 
> inconsistent with the rest of the codebase, it also complicates debugging, 
> since, e.g., a stack trace is not printed on an abort. Having timestamps will 
> be also a huge plus.
> Consider migrating logging in all built-in executors to glog.
> There have been reported issues related to glog internal state races when a 
> process that has glog initialized {{fork-exec}}s another process that also 
> initialize glog. We should investigate how this issue is related to this 
> ticket, cc [~tillt], [~vinodkone], [~bmahler].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)