[jira] [Updated] (MESOS-7882) Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves
[ https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7882: -- Sprint: Mesosphere Sprint 73, Mesosphere Sprint 74 (was: Mesosphere Sprint 73) > Mesos master rescinds all the in-flight offers from all the registered agents > when a new maintenance schedule is posted for a subset of slaves > -- > > Key: MESOS-7882 > URL: https://issues.apache.org/jira/browse/MESOS-7882 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.3.0 > Environment: Ubuntu 14:04(trusty) > Mesos master branch. > SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded >Reporter: Sagar Sadashiv Patwardhan >Assignee: Joseph Wu >Priority: Minor > Labels: maintenance, mesosphere > > We are running mesos 1.1.0 in production. We use a custom autoscaler for > scaling our mesos cluster up and down. While scaling down the cluster, > autoscaler makes a POST request to mesos master /maintenance/schedule > endpoint with a set of slaves to move to maintenance mode. This forces mesos > master to rescind all the in-flight offers from *all the slaves* in the > cluster. If our scheduler accepts one of these offers, then we get a > TASK_LOST status update back for that task. We also see such > (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log > lines in mesos master logs. > After reading the code(refs: > https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it > appears that offers are getting rescinded for all the slaves. I am not sure > what is the expected behavior here, but it makes more sense if only resources > from slaves marked for maintenance are reclaimed. > *Experiment:* > To verify that it is actually happening, I checked out the master branch(sha: > a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log > lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). > Built the binary and started a mesos master and 2 agent processes. Used a > basic python framework that launches docker containers on these slaves. > Verified that there is no existing schedule for any slaves using `curl > 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of > the > slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) > after starting the mesos framework. > *Logs:* > mesos-master: > https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203 > mesos-slave1: > https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31 > mesos-slave2: > https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426 > Mesos framework: > https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a > I think mesos should rescind offers and inverse offers only for those slaves > that are marked for maintenance(draining mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7882) Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves
[ https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-7882: - Shepherd: Benjamin Mahler Story Points: 3 Labels: maintenance mesosphere (was: ) Sprint: Mesosphere Sprint 73 Target Version/s: 1.6.0 > Mesos master rescinds all the in-flight offers from all the registered agents > when a new maintenance schedule is posted for a subset of slaves > -- > > Key: MESOS-7882 > URL: https://issues.apache.org/jira/browse/MESOS-7882 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.3.0 > Environment: Ubuntu 14:04(trusty) > Mesos master branch. > SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded >Reporter: Sagar Sadashiv Patwardhan >Assignee: Joseph Wu >Priority: Minor > Labels: maintenance, mesosphere > > We are running mesos 1.1.0 in production. We use a custom autoscaler for > scaling our mesos cluster up and down. While scaling down the cluster, > autoscaler makes a POST request to mesos master /maintenance/schedule > endpoint with a set of slaves to move to maintenance mode. This forces mesos > master to rescind all the in-flight offers from *all the slaves* in the > cluster. If our scheduler accepts one of these offers, then we get a > TASK_LOST status update back for that task. We also see such > (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log > lines in mesos master logs. > After reading the code(refs: > https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it > appears that offers are getting rescinded for all the slaves. I am not sure > what is the expected behavior here, but it makes more sense if only resources > from slaves marked for maintenance are reclaimed. > *Experiment:* > To verify that it is actually happening, I checked out the master branch(sha: > a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log > lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). > Built the binary and started a mesos master and 2 agent processes. Used a > basic python framework that launches docker containers on these slaves. > Verified that there is no existing schedule for any slaves using `curl > 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of > the > slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) > after starting the mesos framework. > *Logs:* > mesos-master: > https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203 > mesos-slave1: > https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31 > mesos-slave2: > https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426 > Mesos framework: > https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a > I think mesos should rescind offers and inverse offers only for those slaves > that are marked for maintenance(draining mode). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7882) Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves
[ https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sadashiv Patwardhan updated MESOS-7882: - Description: We are running mesos 1.1.0 in production. We use a custom autoscaler for scaling our mesos cluster up and down. While scaling down the cluster, autoscaler makes a POST request to mesos master /maintenance/schedule endpoint with a set of slaves to move to maintenance mode. This forces mesos master to rescind all the in-flight offers from *all the slaves* in the cluster. If our scheduler accepts one of these offers, then we get a TASK_LOST status update back for that task. We also see such (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines in mesos master logs. After reading the code(refs: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it appears that offers are getting rescinded for all the slaves. I am not sure what is the expected behavior here, but it makes more sense if only resources from slaves marked for maintenance are reclaimed. *Experiment:* To verify that it is actually happening, I checked out the master branch(sha: a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). Built the binary and started a mesos master and 2 agent processes. Used a basic python framework that launches docker containers on these slaves. Verified that there is no existing schedule for any slaves using `curl 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) after starting the mesos framework. *Logs:* mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203 mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31 mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426 Mesos framework: https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a I think mesos should rescind offers and inverse offers only for those slaves that are marked for maintenance(draining mode). was: We are running mesos 1.1.0 in production. We use a custom autoscaler for scaling our mesos cluster up and down. While scaling down the cluster, autoscaler makes a POST request to mesos master /maintenance/schedule endpoint with a set of slaves to move to maintenance mode. This forces mesos master to rescind all the in-flight offers from *all the slaves* in the cluster. If our scheduler accepts one of these offers, then we get a TASK_LOST status update back for that task. We also see such (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines in mesos master logs. After reading the code(refs: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it appears that offers are getting rescinded for all the slaves. I am not sure what is the expected behavior here, but it makes more sense if only resources from slaves marked for maintenance are reclaimed. Experiment: To verify that it is actually happening, I checked out the master branch(sha: a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). Built the binary and started a mesos master and 2 agent processes. Used a basic python framework that launches docker containers on these slaves. Verified that there is no existing schedule for any slaves using `curl 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) after starting the mesos framework. Logs: mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203 mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31 mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426 Mesos framework: https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a I think mesos should rescind offers and inverse offers only for those slaves that are marked for maintenance(draining mode). > Mesos master rescinds all the in-flight offers from all the registered agents > when a new maintenance schedule is posted for a subset of slaves > -- > > Key: MESOS-7882 > URL: https://issues.apache.org/jira/browse/MESOS-7882 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.3.0 > Environment: Ubuntu 14:04(trusty) > Mesos master branch. > SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded >Reporter: Sagar Sadashiv Patwardhan >