I have a framework that starts multiple docker containers. The configuration (hosts and ports) of my setup need to stay constant. So in a first step my framework is claiming resources on the slaves. Once all required resources are acquired I start the containers using the docker containerizer. When fails I restart it on the same slave with the same config. So far I'm tracking the Mesos slave ID and would only restart the task if I get an offer for that slave again. As the ID changes now I'm not restarting the task anymore.

My assumption was that the slave ID would stay constant so that I could for example change the host name and would still recognize the instance or start multiple slaves on the same server and easily distinguish them. If the slave ID changes I would have expected that all resources connected to that would be lost but that doesn't seem to be the case, which is good in my case, but rather odd in my opinion.

On 08.11.2016 17:26, Vinod Kone wrote:
@Hendrik: When maintenance APIs are used, the typical expectation is that the tasks on the machine are stopped (and rescheduled elsewhere in the cluster). That is the reason that the agent gets a new ID. What is the exact problem you are facing?

@Justin: This is a known issue that is actively being worked on. https://issues.apache.org/jira/browse/MESOS-5396


On Tue, Nov 8, 2016 at 8:12 AM, Hendrik Haddorp <[email protected] <mailto:[email protected]>> wrote:

    Interesting, in one case we also had a reboot but not in the
    simple restart with a pause test. Losing the ID on restart sounds
    odd to me. Do you have some further details on that?

    On 08.11.2016 17:08, Justin Pinkul wrote:


        Hello,


        I also hit a very similar problem recently, perhaps it is
        related. There is special logic inside of the Mesos agent that
        checks if the machine has rebooted; if it has rebooted it will
        short circuit the recovery and register with a new agent ID.
        This is especially problematic with the new
        --agent_removal_rate_limit and --recovery_agent_removal_limit
        flags. We hit a power outage and which caused this to happen
        on every machine in our lab at once, since every agent had a
        new ID 50% of the ids were considered lost and these safe
        guards caused our master to kill itself every 15 minutes even
        after all of the agents were back up and running. Is there any
        advantage to throwing out the agent ID when rebooting?


        Thanks,

        Justin




        ------------------------------------------------------------------------
        *From:* Hendrik Haddorp <[email protected]
        <mailto:[email protected]>>
        *Sent:* Tuesday, November 8, 2016 12:59 PM
        *To:* user
        *Subject:* Slave gets new ID
        Hi,

        when we take slaves down for maintenance, as described in
        http://mesos.apache.org/documentation/latest/maintenance/
        <http://mesos.apache.org/documentation/latest/maintenance/>,
        the slave
        <http://mesos.apache.org/documentation/latest/maintenance/
        <http://mesos.apache.org/documentation/latest/maintenance/>>

        Apache Mesos - Maintenance Primitives
        <http://mesos.apache.org/documentation/latest/maintenance/
        <http://mesos.apache.org/documentation/latest/maintenance/>>
        mesos.apache.org <http://mesos.apache.org>
        Maintenance Primitives. Operators regularly need to perform
        maintenance tasks on machines that comprise a Mesos cluster.
        Most Mesos upgrades can be done without ...



        gets a new ID on start up. Why is that and can it be changed?
        We are
        using Mesos 0.28.2. I'm so far only aware of the
        slave_reregister_timeout. Our restart was within that time
        frame. When
        we restart a slave it keeps its ID. However when we wait a few
        minutes,
        less then the reregistration timeout, before we restart the
        slave the ID
        also changes.

        regards,
        Hendrik




Reply via email to