Hi All,

anyone can give some hint, let me fasting checkpoint root cause, let me
work on it.

2016-11-22 20:01 GMT+08:00 tommy xiao <[email protected]>:

> Interesting this case. +1
>
> 2016-11-22 12:27 GMT+08:00 X Brick <[email protected]>:
>
>> Found some issues in the JIRA:
>>
>>    - MESOS-5368 <https://issues.apache.org/jira/browse/MESOS-5368>
>>    - MESOS-6223 <https://issues.apache.org/jira/browse/MESOS-6223>
>>    - MESOS-3545 <https://issues.apache.org/jira/browse/MESOS-3545>
>>
>> Not quite sure the boot_id would fix in next release, but you could
>> backup the boot_file file ( in your $work_dir/meta/) after slave start. and
>> restore it with the backup file when restarting, it works well for our
>> cluster with the persistent volumes.
>>
>> 2016-11-09 0:43 GMT+08:00 Hendrik Haddorp <[email protected]>:
>>
>>> I have a framework that starts multiple docker containers. The
>>> configuration (hosts and ports) of my setup need to stay constant. So in a
>>> first step my framework is claiming resources on the slaves. Once all
>>> required resources are acquired I start the containers using the docker
>>> containerizer. When fails I restart it on the same slave with the same
>>> config. So far I'm tracking the Mesos slave ID and would only restart the
>>> task if I get an offer for that slave again. As the ID changes now I'm not
>>> restarting the task anymore.
>>>
>>> My assumption was that the slave ID would stay constant so that I could
>>> for example change the host name and would still recognize the instance or
>>> start multiple slaves on the same server and easily distinguish them. If
>>> the slave ID changes I would have expected that all resources connected to
>>> that would be lost but that doesn't seem to be the case, which is good in
>>> my case, but rather odd in my opinion.
>>>
>>> On 08.11.2016 17:26, Vinod Kone wrote:
>>>
>>>> @Hendrik: When maintenance APIs are used, the typical expectation is
>>>> that the tasks on the machine are stopped (and rescheduled elsewhere in the
>>>> cluster). That is the reason that the agent gets a new ID. What is the
>>>> exact problem you are facing?
>>>>
>>>> @Justin: This is a known issue that is actively being worked on.
>>>> https://issues.apache.org/jira/browse/MESOS-5396
>>>>
>>>>
>>>> On Tue, Nov 8, 2016 at 8:12 AM, Hendrik Haddorp <
>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>
>>>>     Interesting, in one case we also had a reboot but not in the
>>>>     simple restart with a pause test. Losing the ID on restart sounds
>>>>     odd to me. Do you have some further details on that?
>>>>
>>>>     On 08.11.2016 17:08, Justin Pinkul wrote:
>>>>
>>>>
>>>>         Hello,
>>>>
>>>>
>>>>         I also hit a very similar problem recently, perhaps it is
>>>>         related. There is special logic inside of the Mesos agent that
>>>>         checks if the machine has rebooted; if it has rebooted it will
>>>>         short circuit the recovery and register with a new agent ID.
>>>>         This is especially problematic with the new
>>>>         --agent_removal_rate_limit and --recovery_agent_removal_limit
>>>>         flags. We hit a power outage and which caused this to happen
>>>>         on every machine in our lab at once, since every agent had a
>>>>         new ID 50% of the ids were considered lost and these safe
>>>>         guards caused our master to kill itself every 15 minutes even
>>>>         after all of the agents were back up and running. Is there any
>>>>         advantage to throwing out the agent ID when rebooting?
>>>>
>>>>
>>>>         Thanks,
>>>>
>>>>         Justin
>>>>
>>>>
>>>>
>>>>
>>>>         ------------------------------------------------------------
>>>> ------------
>>>>         *From:* Hendrik Haddorp <[email protected]
>>>>         <mailto:[email protected]>>
>>>>         *Sent:* Tuesday, November 8, 2016 12:59 PM
>>>>         *To:* user
>>>>         *Subject:* Slave gets new ID
>>>>         Hi,
>>>>
>>>>         when we take slaves down for maintenance, as described in
>>>>         http://mesos.apache.org/documentation/latest/maintenance/
>>>>         <http://mesos.apache.org/documentation/latest/maintenance/>,
>>>>         the slave
>>>>         <http://mesos.apache.org/documentation/latest/maintenance/
>>>>         <http://mesos.apache.org/documentation/latest/maintenance/>>
>>>>
>>>>         Apache Mesos - Maintenance Primitives
>>>>         <http://mesos.apache.org/documentation/latest/maintenance/
>>>>         <http://mesos.apache.org/documentation/latest/maintenance/>>
>>>>         mesos.apache.org <http://mesos.apache.org>
>>>>         Maintenance Primitives. Operators regularly need to perform
>>>>         maintenance tasks on machines that comprise a Mesos cluster.
>>>>         Most Mesos upgrades can be done without ...
>>>>
>>>>
>>>>
>>>>         gets a new ID on start up. Why is that and can it be changed?
>>>>         We are
>>>>         using Mesos 0.28.2. I'm so far only aware of the
>>>>         slave_reregister_timeout. Our restart was within that time
>>>>         frame. When
>>>>         we restart a slave it keeps its ID. However when we wait a few
>>>>         minutes,
>>>>         less then the reregistration timeout, before we restart the
>>>>         slave the ID
>>>>         also changes.
>>>>
>>>>         regards,
>>>>         Hendrik
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>



-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Reply via email to