[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Fix Version/s: 1.1.3 > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Fix Version/s: 1.1.3 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Fix Version/s: 1.2.2 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Fix Version/s: 1.2.2 > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Fix Version/s: 1.3.1 > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.3.1, 1.4.0 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Fix Version/s: 1.3.1 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 1.3.1, 1.4.0 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Summary: Add an agent flag for executor re-registration timeout. (was: Add an agent flag for executor re-register timeout) > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.4.0 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7579) Deprecate GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}`
Kevin Klues created MESOS-7579: -- Summary: Deprecate GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}` Key: MESOS-7579 URL: https://issues.apache.org/jira/browse/MESOS-7579 Project: Mesos Issue Type: Task Components: allocation, gpu Reporter: Kevin Klues Once we reach Mesos 2.0, we should completely remove the GPU_RESOURCES capability and the corresponding {{--filter-gpu-resources}} that controls whether the allocator honors this capability or not. It will have been deprecated once support for {{dynamic reservations}}, {{hierarchical roles}}, and {{support for reservations to multiple roles}} has landed. The JIRA tracking these features as blockers to this ticket are linked below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7577) Remove GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}`
[ https://issues.apache.org/jira/browse/MESOS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-7577: --- Target Version/s: 2.1.0 Description: Once we reach Mesos 2.0, we should completely remove the GPU_RESOURCES capability and the corresponding {{--filter-gpu-resources}} that controls whether the allocator honors this capability or not. It will have been deprecated once support for {{dynamic reservations}}, {{hierarchical roles}}, and {{support for reservations to multiple roles}} has landed. The JIRA tracking these features as blockers to this ticket are linked below. was: This flag was added as a temporary way to to enable / disable honoring the GPU_RESOURCES framework capability. We should remove it once we have better support for achieving the same functionality that the GPU_RESOURCES capability gives you. This support relies on dynamic reservations, hierarchical roles, and support for reservations to multiple roles (an unyet implemented feature). The JIRA tracking these features as blockers to this ticket are linked below. > Remove GPU_RESOURCES capability and master flag > `--filter-gpu-resources={true|false}` > - > > Key: MESOS-7577 > URL: https://issues.apache.org/jira/browse/MESOS-7577 > Project: Mesos > Issue Type: Task > Components: allocation, gpu >Reporter: Kevin Klues > > Once we reach Mesos 2.0, we should completely remove the GPU_RESOURCES > capability and the corresponding {{--filter-gpu-resources}} that controls > whether the allocator honors this capability or not. > It will have been deprecated once support for {{dynamic reservations}}, > {{hierarchical roles}}, and {{support for reservations to multiple roles}} > has landed. The JIRA tracking these features as blockers to this ticket are > linked below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7500) Command checks via agent lead to flaky tests.
[ https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-7500: -- Sprint: Mesosphere Sprint 56 (was: Mesosphere Sprint 56, Mesosphere Sprint 57) > Command checks via agent lead to flaky tests. > - > > Key: MESOS-7500 > URL: https://issues.apache.org/jira/browse/MESOS-7500 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman > Labels: check, flaky-test, health-check, mesosphere > > Tests that rely on command checks via agent are flaky on Apache CI. Here is > an example from one of the failed run: https://pastebin.com/g2mPgYzu -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7577) Remove GPU_RESOURCES capability and remove master flag `--filter-gpu-resources={true|false}`
[ https://issues.apache.org/jira/browse/MESOS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-7577: --- Summary: Remove GPU_RESOURCES capability and remove master flag `--filter-gpu-resources={true|false}` (was: Remove master flag `--filter-gpu-resources={true|false}`) > Remove GPU_RESOURCES capability and remove master flag > `--filter-gpu-resources={true|false}` > > > Key: MESOS-7577 > URL: https://issues.apache.org/jira/browse/MESOS-7577 > Project: Mesos > Issue Type: Task > Components: allocation, gpu >Reporter: Kevin Klues > > This flag was added as a temporary way to to enable / disable honoring the > GPU_RESOURCES framework capability. We should remove it once we have better > support for achieving the same functionality that the GPU_RESOURCES > capability gives you. > This support relies on dynamic reservations, hierarchical roles, and support > for reservations to multiple roles (an unyet implemented feature). The JIRA > tracking these features as blockers to this ticket are linked below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7578) Write a proposal to make the I/O Switchboards optional
[ https://issues.apache.org/jira/browse/MESOS-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-7578: -- Labels: check containerizer health-check mesosphere (was: mesosphere) > Write a proposal to make the I/O Switchboards optional > -- > > Key: MESOS-7578 > URL: https://issues.apache.org/jira/browse/MESOS-7578 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman > Labels: check, containerizer, health-check, mesosphere > > Right now DEBUG containers can only be started using the > LaunchNestedContainerSession API call. They will enter its parent’s > namespaces, inherit environment variables, stream its I/O, and Mesos will tie > their life-cycle to the lifetime of the HTTP connection. > Streaming the I/O of a container requires an I/O Switchboard and adds some > overhead and complexity: > - Mesos will launch an extra process, called an I/O Switchboard for each > nested container. These process aren’t free, they take some time to > create/destroy and consume resources. > - I/O Switchboards are managed by a complex isolator. > - /O Swichboards introduce new race conditions, and have been a source of > deadlocks in the past. > Some use cases require some of the features provided by DEBUG containers, but > don’t need the functionality provided by the I/O switchboard. For instance, > the Default Executor uses DEBUG containers to perform (health)checks, but it > doesn’t need to stream anything to/from the container. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7577) Remove GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}`
[ https://issues.apache.org/jira/browse/MESOS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-7577: --- Summary: Remove GPU_RESOURCES capability and master flag `--filter-gpu-resources={true|false}` (was: Remove GPU_RESOURCES capability and remove master flag `--filter-gpu-resources={true|false}`) > Remove GPU_RESOURCES capability and master flag > `--filter-gpu-resources={true|false}` > - > > Key: MESOS-7577 > URL: https://issues.apache.org/jira/browse/MESOS-7577 > Project: Mesos > Issue Type: Task > Components: allocation, gpu >Reporter: Kevin Klues > > This flag was added as a temporary way to to enable / disable honoring the > GPU_RESOURCES framework capability. We should remove it once we have better > support for achieving the same functionality that the GPU_RESOURCES > capability gives you. > This support relies on dynamic reservations, hierarchical roles, and support > for reservations to multiple roles (an unyet implemented feature). The JIRA > tracking these features as blockers to this ticket are linked below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7578) Write a proposal to make the I/O Switchboards optional
Gastón Kleiman created MESOS-7578: - Summary: Write a proposal to make the I/O Switchboards optional Key: MESOS-7578 URL: https://issues.apache.org/jira/browse/MESOS-7578 Project: Mesos Issue Type: Task Components: containerization Reporter: Gastón Kleiman Assignee: Gastón Kleiman Right now DEBUG containers can only be started using the LaunchNestedContainerSession API call. They will enter its parent’s namespaces, inherit environment variables, stream its I/O, and Mesos will tie their life-cycle to the lifetime of the HTTP connection. Streaming the I/O of a container requires an I/O Switchboard and adds some overhead and complexity: - Mesos will launch an extra process, called an I/O Switchboard for each nested container. These process aren’t free, they take some time to create/destroy and consume resources. - I/O Switchboards are managed by a complex isolator. - /O Swichboards introduce new race conditions, and have been a source of deadlocks in the past. Some use cases require some of the features provided by DEBUG containers, but don’t need the functionality provided by the I/O switchboard. For instance, the Default Executor uses DEBUG containers to perform (health)checks, but it doesn’t need to stream anything to/from the container. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7577) Remove master flag `--filter-gpu-resources={true|false}`
Kevin Klues created MESOS-7577: -- Summary: Remove master flag `--filter-gpu-resources={true|false}` Key: MESOS-7577 URL: https://issues.apache.org/jira/browse/MESOS-7577 Project: Mesos Issue Type: Task Components: allocation, gpu Reporter: Kevin Klues This flag was added as a temporary way to to enable / disable honoring the GPU_RESOURCES framework capability. We should remove it once we have better support for achieving the same functionality that the GPU_RESOURCES capability gives you. This support relies on dynamic reservations, hierarchical roles, and support for reservations to multiple roles (an unyet implemented feature). The JIRA tracking these features as blockers to this ticket are linked below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7576) Add master flag `--filter-gpu-resources={true|false}`
[ https://issues.apache.org/jira/browse/MESOS-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-7576: --- Description: Per the email thread below, we are adding a new flag on the master called {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} framework capability. https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html When set to {{true}}, this flag will cause the mesos master to continue to function as it does today. That is, it will filter offers containing GPU resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will cause the master to *not* filter offers containing GPU resources, and indiscriminately send them to all frameworks whether they set the {{GPU_RESOURCES}} capability or not. This is a temporary flag that will eventually be removed. We will remove it once we have better support for achieving the same functionality that the {{GPU_RESOURCES}} capability gives you. As described in the email, this support relies on {{dynamic reservations}}, {{hierarchical roles}}, and support for {{reservations to multiple roles}} (an unyet implemented feature). The JIRA tracking these features are linked below. was: Per the email thread below, we are adding a new flag on the master called {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} framework capability. https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html When set to {{true}}, this flag will cause the mesos master to continue to function as it does today. That is, it will filter offers containing GPU resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will cause the master to *not* filter offers containing GPU resources, and indiscriminately send them to all frameworks whether they set the {{GPU_RESOURCES}} capability or not. This is a temporary flag that will eventually be removed. We will remove it once we have better support for achieving the same functionality that the {{GPU_RESOURCES}} capability gives you. As described in the email, this support relies {{dynamic reservations}}, {{hierarchical roles}}, and support for {{reservations to multiple roles}} (an unyet implemented feature). The JIRA tracking these features are linked below. > Add master flag `--filter-gpu-resources={true|false}` > - > > Key: MESOS-7576 > URL: https://issues.apache.org/jira/browse/MESOS-7576 > Project: Mesos > Issue Type: Task > Components: gpu >Affects Versions: 1.2.0 >Reporter: Kevin Klues >Assignee: Kevin Klues > > Per the email thread below, we are adding a new flag on the master called > {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} > framework capability. > https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html > When set to {{true}}, this flag will cause the mesos master to continue to > function as it does today. That is, it will filter offers containing GPU > resources and only send them to frameworks that opt into the > {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will > cause the master to *not* filter offers containing GPU resources, and > indiscriminately send them to all frameworks whether they set the > {{GPU_RESOURCES}} capability or not. > This is a temporary flag that will eventually be removed. We will remove it > once we have better support for achieving the same functionality that the > {{GPU_RESOURCES}} capability gives you. > As described in the email, this support relies on {{dynamic reservations}}, > {{hierarchical roles}}, and support for {{reservations to multiple roles}} > (an unyet implemented feature). The JIRA tracking these features are linked > below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7576) Add master flag `--filter-gpu-resources={true|false}`
[ https://issues.apache.org/jira/browse/MESOS-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-7576: --- Description: Per the email thread below, we are adding a new flag on the master called {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} framework capability. https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html When set to {{true}}, this flag will cause the mesos master to continue to function as it does today. That is, it will filter offers containing GPU resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will cause the master to *not* filter offers containing GPU resources, and indiscriminately send them to all frameworks whether they set the {{GPU_RESOURCES}} capability or not. This is a temporary flag that will eventually be removed. We will remove it once we have better support for achieving the same functionality that the {{GPU_RESOURCES}} capability gives you. As described in the email, this support relies {{dynamic reservations}}, {{hierarchical roles}}, and support for {{reservations to multiple roles}} (an unyet implemented feature). The JIRA tracking these features are linked below. was: Per the email thread below, we are adding a new flag on the master called {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} framework capability. https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html When set to {{true}}, this flag will cause the mesos master to continue to function as it does today. That is, it will filter offers containing GPU resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will cause the master to *not* filter offers containing GPU resources, and indiscriminately send them to all frameworks whether they set the {{GPU_RESOURCES}} capability or not. This is a temporary flag that will eventually be removed. We will remove it once we have better support for achieving the same functionality that the {{GPU_RESOURCES}} capability gives you. As described in the email, this support relies {{reservations}}, {{hierarchical roles}}, and support for {{reservations to multiple roles}} (an unyet implemented feature). The JIRA tracking these features are linked below. > Add master flag `--filter-gpu-resources={true|false}` > - > > Key: MESOS-7576 > URL: https://issues.apache.org/jira/browse/MESOS-7576 > Project: Mesos > Issue Type: Task > Components: gpu >Affects Versions: 1.2.0 >Reporter: Kevin Klues >Assignee: Kevin Klues > > Per the email thread below, we are adding a new flag on the master called > {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} > framework capability. > https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html > When set to {{true}}, this flag will cause the mesos master to continue to > function as it does today. That is, it will filter offers containing GPU > resources and only send them to frameworks that opt into the > {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will > cause the master to *not* filter offers containing GPU resources, and > indiscriminately send them to all frameworks whether they set the > {{GPU_RESOURCES}} capability or not. > This is a temporary flag that will eventually be removed. We will remove it > once we have better support for achieving the same functionality that the > {{GPU_RESOURCES}} capability gives you. > As described in the email, this support relies {{dynamic reservations}}, > {{hierarchical roles}}, and support for {{reservations to multiple roles}} > (an unyet implemented feature). The JIRA tracking these features are linked > below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7575) Support hierarchical reservations
[ https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026861#comment-16026861 ] Michael Park commented on MESOS-7575: - Design doc: https://docs.google.com/document/d/1Di6drHrBs3FWYJXKQjCTqQMi2PfdtCrf4OcuP3RZqmk/ > Support hierarchical reservations > - > > Key: MESOS-7575 > URL: https://issues.apache.org/jira/browse/MESOS-7575 > Project: Mesos > Issue Type: Task >Reporter: Michael Park > > With the introduction of hierarchical roles, Mesos provides a mechanism to > delegate resources down a hierarchy. To complement this, we need to introduce > a notion of hierarchical reservations so that we can *refine* the > reservations down the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7575) Support hierarchical reservations
Michael Park created MESOS-7575: --- Summary: Support hierarchical reservations Key: MESOS-7575 URL: https://issues.apache.org/jira/browse/MESOS-7575 Project: Mesos Issue Type: Bug Reporter: Michael Park With the introduction of hierarchical roles, Mesos provides a mechanism to delegate resources down a hierarchy. To complement this, we need to introduce a notion of hierarchical reservations so that we can *refine* the reservations down the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7576) Add master flag `--filter-gpu-resources={true|false}`
Kevin Klues created MESOS-7576: -- Summary: Add master flag `--filter-gpu-resources={true|false}` Key: MESOS-7576 URL: https://issues.apache.org/jira/browse/MESOS-7576 Project: Mesos Issue Type: Task Components: gpu Affects Versions: 1.2.0 Reporter: Kevin Klues Assignee: Kevin Klues Per the email thread below, we are adding a new flag on the master called {{--filter-gpu-resources}} to enable / disable honoring the {{GPU_RESOURCES}} framework capability. https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html When set to {{true}}, this flag will cause the mesos master to continue to function as it does today. That is, it will filter offers containing GPU resources and only send them to frameworks that opt into the {{GPU_RESOURCES}} framework capability. When set to {{false}}, this flag will cause the master to *not* filter offers containing GPU resources, and indiscriminately send them to all frameworks whether they set the {{GPU_RESOURCES}} capability or not. This is a temporary flag that will eventually be removed. We will remove it once we have better support for achieving the same functionality that the {{GPU_RESOURCES}} capability gives you. As described in the email, this support relies {{reservations}}, {{hierarchical roles}}, and support for {{reservations to multiple roles}} (an unyet implemented feature). The JIRA tracking these features are linked below. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7575) Support hierarchical reservations
[ https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-7575: Sprint: Mesosphere Sprint 57 Issue Type: Task (was: Bug) > Support hierarchical reservations > - > > Key: MESOS-7575 > URL: https://issues.apache.org/jira/browse/MESOS-7575 > Project: Mesos > Issue Type: Task >Reporter: Michael Park > > With the introduction of hierarchical roles, Mesos provides a mechanism to > delegate resources down a hierarchy. To complement this, we need to introduce > a notion of hierarchical reservations so that we can *refine* the > reservations down the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.
[ https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026856#comment-16026856 ] Gastón Kleiman commented on MESOS-7500: --- The failures seem to be related to the agent not being able to attach to the DEBUG container launched by the health checker. This is however not really necessary for checks, so I created a [design document|https://docs.google.com/document/d/1YCMtH8i2-ovTVtKDsCTrXdygS7ieaSrJLVnFbR66qfA/] with two proposals that'd make it possible to start DEBUG containers without an I/O switchboard. > Command checks via agent lead to flaky tests. > - > > Key: MESOS-7500 > URL: https://issues.apache.org/jira/browse/MESOS-7500 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman > Labels: check, flaky-test, health-check, mesosphere > > Tests that rely on command checks via agent are flaky on Apache CI. Here is > an example from one of the failed run: https://pastebin.com/g2mPgYzu -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7574) Allow reservations to multiple roles.
Benjamin Mahler created MESOS-7574: -- Summary: Allow reservations to multiple roles. Key: MESOS-7574 URL: https://issues.apache.org/jira/browse/MESOS-7574 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler There have been some discussions for allowing reservations to multiple roles (or more generally, role expressions). E.g. All resources on GPU agents are reserved for "eng/machine-learning" or "finance/forecasting" or "data-science/modeling" to use, because these are the roles in my organization that make use of GPUs, and I want to guarantee that none of the non-GPU workloads tie up the GPU machines cpus/mem/disk. This GPU related example would allow us to deprecate and remove the GPU_RESOURCES capability, which is a hack implementation of reservations to multiple roles. Mesos will only offer GPU machine resources to GPU capable schedulers. Having the ability to make reservations to multiple roles obviates this hack. With hierarchical roles, we have a restricted version of reservations to multiple roles, where the roles are restricted to the descendant roles. For example, a reservation for "gpu-workloads" can be allocated to "gpu-workloads/eng/image-processing", "gpu-workloads/data-science/modeling", "gpu-workloads/finance/forecasting etc. What isn't achievable is a reservation to multiple roles across the tree, e.g. "eng/image-processing" OR "finance/forecasting" OR "data-science/modeling". This can get clumsy because if "eng/ML" wants to get in on the reserved gpus, the user would have to place a related role underneath the "gpu-workloads" role, e.g. "gpu-workloads/eng/ML". A similar use case has been that some agents are "public" and there are disparate roles in the organization that need access to these hosts, so we want to ensure that only these roles get access and no other roles can tie up the resources on these hosts. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove
[ https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026712#comment-16026712 ] Zhitao Li commented on MESOS-7566: -- I suspect this is another manifesting of root cause in MESOS-4553. A couple of observations: 1 there is always a revocable resource decrease as well as a UNRESERVE operation before crash; 2 DRFSorter somehow gets updated with the newer (and smaller value) in its total_ but somehow still asked to remove an older value, thus code crashed; 3 The reason about 2 is possibly a race condition between master and hierarchical process queue (unfortunately, without a coredump or verbose logging, this is still pretty hard to diagnose further based on my knowledge of the codebase, as there are still multiple code paths leading the crash) > Master crash due to failed check in DRFSorter::remove > - > > Key: MESOS-7566 > URL: https://issues.apache.org/jira/browse/MESOS-7566 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.1, 1.1.2 >Reporter: Zhitao Li >Priority: Critical > > A check in [sorter.cpp#L355 in 1.1.2 | > https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355] > is triggered occasionally in our cluster and crashes the master leader. > I manually modified that check to print out the related variables, and the > following is a master log. > https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt > From the log, it seems like the check was using an stale value revocable CPU > {{26}} while the new value was updated to 25, thus the check crashed. > So far two verified occurrence of this bug are both observed near an > {{UNRESERVE}} operation (see lines above in the log). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-4210) Investigate increasing protobuf protocol message size limit.
[ https://issues.apache.org/jira/browse/MESOS-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026526#comment-16026526 ] Anand Mazumdar commented on MESOS-4210: --- This shouldn't be a concern anymore after the upgrade to proto3 as part of MESOS-7228 that supports message sizes up to 2GB. Marking it as resolved. > Investigate increasing protobuf protocol message size limit. > > > Key: MESOS-4210 > URL: https://issues.apache.org/jira/browse/MESOS-4210 > Project: Mesos > Issue Type: Bug >Reporter: Artem Harutyunyan > Fix For: 1.4.0 > > > {noformat} > [libprotobuf ERROR google/protobuf/io/coded_stream.cc:171] A protocol message > was rejected because it was too big (more than 67108864 bytes). To increase > the limit (or to disable these warnings), see > CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. > F20151217 16:33:44.832834 4076 construct.cpp:48] Check failed: parsed > Unexpected failure while parsing protobuf > Check failure stack trace: *** > @ 0x2b9bab353b68 (unknown) > @ 0x2b9bab353ac4 (unknown) > @ 0x2b9bab3534ba (unknown) > @ 0x2b9bab356274 (unknown) > @ 0x2b9bab339d09 (unknown) > @ 0x2b9bab338917 (unknown) > @ 0x2b9bab33f404 (unknown) > @ 0x2b9b68350e18 (unknown) > {noformat} > The error is presumably caused by a "user sending a very large command line". -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7017) HTTP API responses can crash the master.
[ https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026525#comment-16026525 ] Anand Mazumdar commented on MESOS-7017: --- This should be fixed via MESOS-7228. Marking it as resolved. > HTTP API responses can crash the master. > > > Key: MESOS-7017 > URL: https://issues.apache.org/jira/browse/MESOS-7017 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: James Peach >Priority: Critical > Fix For: 1.4.0 > > > The master can crash when generating large responses to small API requests. > One manifestation of this is querying the tasks. > {noformat} > [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message > was rejected because it was too big (more than 67108864 bytes). To increase > the limit (or to disable these warnings), see > CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. > F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: > t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while > evolving from mesos.master.Response > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7573) Fix /profiler endpoint to use perf
[ https://issues.apache.org/jira/browse/MESOS-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026429#comment-16026429 ] Zhitao Li commented on MESOS-7573: -- [~bmahler], do you have time to shephard this? I need to get some perf endpoints working anyway so I'm willing to take this work. > Fix /profiler endpoint to use perf > -- > > Key: MESOS-7573 > URL: https://issues.apache.org/jira/browse/MESOS-7573 > Project: Mesos > Issue Type: Bug >Reporter: Zhitao Li > > Right now, the [ profiler | > http://mesos.apache.org/documentation/latest/endpoints/profiler/start/ ] > endpoints seem pretty broken (I can't even generate a working build from > master). > Based on a slack conversation with [~bmahler], that endpoint was added when [ > linux perf | https://perf.wiki.kernel.org/index.php/Main_Page ] was not > available yet in old centos. [~bmahler] suggests that we replace gperftools > with linux perf, and probably fix this endpoint to automatically generate > framegraphs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7573) Fix /profiler endpoint to use perf
Zhitao Li created MESOS-7573: Summary: Fix /profiler endpoint to use perf Key: MESOS-7573 URL: https://issues.apache.org/jira/browse/MESOS-7573 Project: Mesos Issue Type: Bug Reporter: Zhitao Li Right now, the [ profiler | http://mesos.apache.org/documentation/latest/endpoints/profiler/start/ ] endpoints seem pretty broken (I can't even generate a working build from master). Based on a slack conversation with [~bmahler], that endpoint was added when [ linux perf | https://perf.wiki.kernel.org/index.php/Main_Page ] was not available yet in old centos. [~bmahler] suggests that we replace gperftools with linux perf, and probably fix this endpoint to automatically generate framegraphs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7572) Follow symlinks in the various master/agent endpoints
Aaron Wood created MESOS-7572: - Summary: Follow symlinks in the various master/agent endpoints Key: MESOS-7572 URL: https://issues.apache.org/jira/browse/MESOS-7572 Project: Mesos Issue Type: Improvement Components: agent, HTTP API, master Reporter: Aaron Wood Assignee: Aaron Wood The main benefit of following symlinks in endpoints such as {code}/files{code} is that frameworks will be able to construct a path to the sandbox much easier. This will assist framework developers in making features that need to provide a path when hitting various operator API endpoints. Currently, making use of a path ending in {code}runs/latest{code} throws a 404. One such application could be a scheduler providing the ability for users to work with their task's sandbox directly without going to the Mesos UI, endpoints, or the actual system themselves. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-6961) Executors don't use glog for logging.
[ https://issues.apache.org/jira/browse/MESOS-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-6961: Assignee: Andrei Budnik > Executors don't use glog for logging. > - > > Key: MESOS-6961 > URL: https://issues.apache.org/jira/browse/MESOS-6961 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik > Labels: log, mesosphere, newbie++ > > Built-in Mesos executors use {{cout}}/{{cerr}} for logging. This is not only > inconsistent with the rest of the codebase, it also complicates debugging, > since, e.g., a stack trace is not printed on an abort. Having timestamps will > be also a huge plus. > Consider migrating logging in all built-in executors to glog. > There have been reported issues related to glog internal state races when a > process that has glog initialized {{fork-exec}}s another process that also > initialize glog. We should investigate how this issue is related to this > ticket, cc [~tillt], [~vinodkone], [~bmahler]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)