@Hendrik: When maintenance APIs are used, the typical expectation is that
the tasks on the machine are stopped (and rescheduled elsewhere in the
cluster). That is the reason that the agent gets a new ID. What is the
exact problem you are facing?

@Justin: This is a known issue that is actively being worked on.
https://issues.apache.org/jira/browse/MESOS-5396


On Tue, Nov 8, 2016 at 8:12 AM, Hendrik Haddorp <[email protected]>
wrote:

> Interesting, in one case we also had a reboot but not in the simple
> restart with a pause test. Losing the ID on restart sounds odd to me. Do
> you have some further details on that?
>
> On 08.11.2016 17:08, Justin Pinkul wrote:
>
>>
>> Hello,
>>
>>
>> I also hit a very similar problem recently, perhaps it is related. There
>> is special logic inside of the Mesos agent that checks if the machine has
>> rebooted; if it has rebooted it will short circuit the recovery and
>> register with a new agent ID. This is especially problematic with the new
>> --agent_removal_rate_limit and --recovery_agent_removal_limit flags. We hit
>> a power outage and which caused this to happen on every machine in our lab
>> at once, since every agent had a new ID 50% of the ids were considered lost
>> and these safe guards caused our master to kill itself every 15 minutes
>> even after all of the agents were back up and running. Is there any
>> advantage to throwing out the agent ID when rebooting?
>>
>>
>> Thanks,
>>
>> Justin
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Hendrik Haddorp <[email protected]>
>> *Sent:* Tuesday, November 8, 2016 12:59 PM
>> *To:* user
>> *Subject:* Slave gets new ID
>> Hi,
>>
>> when we take slaves down for maintenance, as described in
>> http://mesos.apache.org/documentation/latest/maintenance/, the slave
>> <http://mesos.apache.org/documentation/latest/maintenance/>
>>
>> Apache Mesos - Maintenance Primitives <http://mesos.apache.org/docum
>> entation/latest/maintenance/>
>> mesos.apache.org
>> Maintenance Primitives. Operators regularly need to perform maintenance
>> tasks on machines that comprise a Mesos cluster. Most Mesos upgrades can be
>> done without ...
>>
>>
>>
>> gets a new ID on start up. Why is that and can it be changed? We are
>> using Mesos 0.28.2. I'm so far only aware of the
>> slave_reregister_timeout. Our restart was within that time frame. When
>> we restart a slave it keeps its ID. However when we wait a few minutes,
>> less then the reregistration timeout, before we restart the slave the ID
>> also changes.
>>
>> regards,
>> Hendrik
>>
>
>

Reply via email to