Here you are:

https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

You can see in the mesos-master.INFO log that it re-registers the slave
using port :5050 (line 9) and fails the health checks on port :5051 (line
10). So it might be the slave that re-uses the old configuration?

Thanks,
Philippe

On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vinodk...@gmail.com> wrote:

> Can you paste some logs?
>
> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <phili...@hopper.com>
> wrote:
>
>> Ok, that's reasonable, but I'm not sure why it would successfully
>> re-register with the master if it's not supposed to in the first place. I
>> think changing the resources (for example) will dump the old configuration
>> in the logs and tell you why recovery is bailing out. It's not doing that
>> in this case.
>>
>> I looks as though this doesn't work only because the master can't ping
>> the slave on the old port, because the whole recovery process was
>> successful otherwise.
>>
>> I'm not sure if the slave could have picked up its configuration change
>> and failed the recovery early, but that would definitely be a better
>> experience.
>>
>> Philippe
>>
>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodk...@gmail.com> wrote:
>>
>>> For slave recovery to work, it is expected to not change its config.
>>>
>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <phili...@hopper.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>> configured with checkpointing and with "reconnect" recovery.
>>>>
>>>> I was investigating why the slaves would successfully re-register with
>>>> the master and recover, but would subsequently be asked to shutdown
>>>> ("health check timeout").
>>>>
>>>> It turns out that our slaves had been unintentionally configured to use
>>>> port 5050 in the previous configuration. We decided to fix that during the
>>>> upgrade and have them use the default 5051 port.
>>>>
>>>> This change seems to make the health checks fail and eventually kills
>>>> the slave due to inactivity.
>>>>
>>>> I've confirmed that leaving the port to what it was in the previous
>>>> configuration makes the slave successfully re-register and is not asked to
>>>> shutdown later on.
>>>>
>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>
>>>> Thanks,
>>>> Philippe
>>>>
>>>
>>>
>>
>

Reply via email to