Hi All, anyone can give some hint, let me fasting checkpoint root cause, let me work on it.
2016-11-22 20:01 GMT+08:00 tommy xiao <[email protected]>: > Interesting this case. +1 > > 2016-11-22 12:27 GMT+08:00 X Brick <[email protected]>: > >> Found some issues in the JIRA: >> >> - MESOS-5368 <https://issues.apache.org/jira/browse/MESOS-5368> >> - MESOS-6223 <https://issues.apache.org/jira/browse/MESOS-6223> >> - MESOS-3545 <https://issues.apache.org/jira/browse/MESOS-3545> >> >> Not quite sure the boot_id would fix in next release, but you could >> backup the boot_file file ( in your $work_dir/meta/) after slave start. and >> restore it with the backup file when restarting, it works well for our >> cluster with the persistent volumes. >> >> 2016-11-09 0:43 GMT+08:00 Hendrik Haddorp <[email protected]>: >> >>> I have a framework that starts multiple docker containers. The >>> configuration (hosts and ports) of my setup need to stay constant. So in a >>> first step my framework is claiming resources on the slaves. Once all >>> required resources are acquired I start the containers using the docker >>> containerizer. When fails I restart it on the same slave with the same >>> config. So far I'm tracking the Mesos slave ID and would only restart the >>> task if I get an offer for that slave again. As the ID changes now I'm not >>> restarting the task anymore. >>> >>> My assumption was that the slave ID would stay constant so that I could >>> for example change the host name and would still recognize the instance or >>> start multiple slaves on the same server and easily distinguish them. If >>> the slave ID changes I would have expected that all resources connected to >>> that would be lost but that doesn't seem to be the case, which is good in >>> my case, but rather odd in my opinion. >>> >>> On 08.11.2016 17:26, Vinod Kone wrote: >>> >>>> @Hendrik: When maintenance APIs are used, the typical expectation is >>>> that the tasks on the machine are stopped (and rescheduled elsewhere in the >>>> cluster). That is the reason that the agent gets a new ID. What is the >>>> exact problem you are facing? >>>> >>>> @Justin: This is a known issue that is actively being worked on. >>>> https://issues.apache.org/jira/browse/MESOS-5396 >>>> >>>> >>>> On Tue, Nov 8, 2016 at 8:12 AM, Hendrik Haddorp < >>>> [email protected] <mailto:[email protected]>> wrote: >>>> >>>> Interesting, in one case we also had a reboot but not in the >>>> simple restart with a pause test. Losing the ID on restart sounds >>>> odd to me. Do you have some further details on that? >>>> >>>> On 08.11.2016 17:08, Justin Pinkul wrote: >>>> >>>> >>>> Hello, >>>> >>>> >>>> I also hit a very similar problem recently, perhaps it is >>>> related. There is special logic inside of the Mesos agent that >>>> checks if the machine has rebooted; if it has rebooted it will >>>> short circuit the recovery and register with a new agent ID. >>>> This is especially problematic with the new >>>> --agent_removal_rate_limit and --recovery_agent_removal_limit >>>> flags. We hit a power outage and which caused this to happen >>>> on every machine in our lab at once, since every agent had a >>>> new ID 50% of the ids were considered lost and these safe >>>> guards caused our master to kill itself every 15 minutes even >>>> after all of the agents were back up and running. Is there any >>>> advantage to throwing out the agent ID when rebooting? >>>> >>>> >>>> Thanks, >>>> >>>> Justin >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------ >>>> ------------ >>>> *From:* Hendrik Haddorp <[email protected] >>>> <mailto:[email protected]>> >>>> *Sent:* Tuesday, November 8, 2016 12:59 PM >>>> *To:* user >>>> *Subject:* Slave gets new ID >>>> Hi, >>>> >>>> when we take slaves down for maintenance, as described in >>>> http://mesos.apache.org/documentation/latest/maintenance/ >>>> <http://mesos.apache.org/documentation/latest/maintenance/>, >>>> the slave >>>> <http://mesos.apache.org/documentation/latest/maintenance/ >>>> <http://mesos.apache.org/documentation/latest/maintenance/>> >>>> >>>> Apache Mesos - Maintenance Primitives >>>> <http://mesos.apache.org/documentation/latest/maintenance/ >>>> <http://mesos.apache.org/documentation/latest/maintenance/>> >>>> mesos.apache.org <http://mesos.apache.org> >>>> Maintenance Primitives. Operators regularly need to perform >>>> maintenance tasks on machines that comprise a Mesos cluster. >>>> Most Mesos upgrades can be done without ... >>>> >>>> >>>> >>>> gets a new ID on start up. Why is that and can it be changed? >>>> We are >>>> using Mesos 0.28.2. I'm so far only aware of the >>>> slave_reregister_timeout. Our restart was within that time >>>> frame. When >>>> we restart a slave it keeps its ID. However when we wait a few >>>> minutes, >>>> less then the reregistration timeout, before we restart the >>>> slave the ID >>>> also changes. >>>> >>>> regards, >>>> Hendrik >>>> >>>> >>>> >>>> >>> >> > > > -- > Deshi Xiao > Twitter: xds2000 > E-mail: xiaods(AT)gmail.com > -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com

