Found some issues in the JIRA: - MESOS-5368 <https://issues.apache.org/jira/browse/MESOS-5368> - MESOS-6223 <https://issues.apache.org/jira/browse/MESOS-6223> - MESOS-3545 <https://issues.apache.org/jira/browse/MESOS-3545>
Not quite sure the boot_id would fix in next release, but you could backup the boot_file file ( in your $work_dir/meta/) after slave start. and restore it with the backup file when restarting, it works well for our cluster with the persistent volumes. 2016-11-09 0:43 GMT+08:00 Hendrik Haddorp <[email protected]>: > I have a framework that starts multiple docker containers. The > configuration (hosts and ports) of my setup need to stay constant. So in a > first step my framework is claiming resources on the slaves. Once all > required resources are acquired I start the containers using the docker > containerizer. When fails I restart it on the same slave with the same > config. So far I'm tracking the Mesos slave ID and would only restart the > task if I get an offer for that slave again. As the ID changes now I'm not > restarting the task anymore. > > My assumption was that the slave ID would stay constant so that I could > for example change the host name and would still recognize the instance or > start multiple slaves on the same server and easily distinguish them. If > the slave ID changes I would have expected that all resources connected to > that would be lost but that doesn't seem to be the case, which is good in > my case, but rather odd in my opinion. > > On 08.11.2016 17:26, Vinod Kone wrote: > >> @Hendrik: When maintenance APIs are used, the typical expectation is that >> the tasks on the machine are stopped (and rescheduled elsewhere in the >> cluster). That is the reason that the agent gets a new ID. What is the >> exact problem you are facing? >> >> @Justin: This is a known issue that is actively being worked on. >> https://issues.apache.org/jira/browse/MESOS-5396 >> >> >> On Tue, Nov 8, 2016 at 8:12 AM, Hendrik Haddorp <[email protected] >> <mailto:[email protected]>> wrote: >> >> Interesting, in one case we also had a reboot but not in the >> simple restart with a pause test. Losing the ID on restart sounds >> odd to me. Do you have some further details on that? >> >> On 08.11.2016 17:08, Justin Pinkul wrote: >> >> >> Hello, >> >> >> I also hit a very similar problem recently, perhaps it is >> related. There is special logic inside of the Mesos agent that >> checks if the machine has rebooted; if it has rebooted it will >> short circuit the recovery and register with a new agent ID. >> This is especially problematic with the new >> --agent_removal_rate_limit and --recovery_agent_removal_limit >> flags. We hit a power outage and which caused this to happen >> on every machine in our lab at once, since every agent had a >> new ID 50% of the ids were considered lost and these safe >> guards caused our master to kill itself every 15 minutes even >> after all of the agents were back up and running. Is there any >> advantage to throwing out the agent ID when rebooting? >> >> >> Thanks, >> >> Justin >> >> >> >> >> ------------------------------------------------------------ >> ------------ >> *From:* Hendrik Haddorp <[email protected] >> <mailto:[email protected]>> >> *Sent:* Tuesday, November 8, 2016 12:59 PM >> *To:* user >> *Subject:* Slave gets new ID >> Hi, >> >> when we take slaves down for maintenance, as described in >> http://mesos.apache.org/documentation/latest/maintenance/ >> <http://mesos.apache.org/documentation/latest/maintenance/>, >> the slave >> <http://mesos.apache.org/documentation/latest/maintenance/ >> <http://mesos.apache.org/documentation/latest/maintenance/>> >> >> Apache Mesos - Maintenance Primitives >> <http://mesos.apache.org/documentation/latest/maintenance/ >> <http://mesos.apache.org/documentation/latest/maintenance/>> >> mesos.apache.org <http://mesos.apache.org> >> Maintenance Primitives. Operators regularly need to perform >> maintenance tasks on machines that comprise a Mesos cluster. >> Most Mesos upgrades can be done without ... >> >> >> >> gets a new ID on start up. Why is that and can it be changed? >> We are >> using Mesos 0.28.2. I'm so far only aware of the >> slave_reregister_timeout. Our restart was within that time >> frame. When >> we restart a slave it keeps its ID. However when we wait a few >> minutes, >> less then the reregistration timeout, before we restart the >> slave the ID >> also changes. >> >> regards, >> Hendrik >> >> >> >> >

