Looks like this is due to a bug in versions < 23.0, where slave recovery
didn't check for changes in 'port' when considering compatibility
<https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137>.
It has since been fixed in the upcoming 0.23.0 release.

On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme <[email protected]>
wrote:

> Checkpointing has been enabled since 0.18 on these slaves. The only other
> setting that changed during the upgrade was that we added --gc_delay=1days.
> Otherwise, it's an in-place upgrade without any changes to the work
> directory...
>
> Philippe
>
> On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <[email protected]> wrote:
>
>> It is surprising that the slave didn't bail out during the initial phase
>> of recovery when the port changed. I'm assuming you enabled checkpointing
>> in 0.20.0 and that you didn't wipe the meta data directory or anything when
>> upgrading to 21.0?
>>
>> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <[email protected]>
>> wrote:
>>
>>> Here you are:
>>>
>>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>>>
>>> You can see in the mesos-master.INFO log that it re-registers the slave
>>> using port :5050 (line 9) and fails the health checks on port :5051 (line
>>> 10). So it might be the slave that re-uses the old configuration?
>>>
>>> Thanks,
>>> Philippe
>>>
>>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <[email protected]> wrote:
>>>
>>>> Can you paste some logs?
>>>>
>>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <[email protected]>
>>>> wrote:
>>>>
>>>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>>>> re-register with the master if it's not supposed to in the first place. I
>>>>> think changing the resources (for example) will dump the old configuration
>>>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>>>> in this case.
>>>>>
>>>>> I looks as though this doesn't work only because the master can't ping
>>>>> the slave on the old port, because the whole recovery process was
>>>>> successful otherwise.
>>>>>
>>>>> I'm not sure if the slave could have picked up its configuration
>>>>> change and failed the recovery early, but that would definitely be a 
>>>>> better
>>>>> experience.
>>>>>
>>>>> Philippe
>>>>>
>>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> For slave recovery to work, it is expected to not change its config.
>>>>>>
>>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>>>
>>>>>>> I was investigating why the slaves would successfully re-register
>>>>>>> with the master and recover, but would subsequently be asked to shutdown
>>>>>>> ("health check timeout").
>>>>>>>
>>>>>>> It turns out that our slaves had been unintentionally configured to
>>>>>>> use port 5050 in the previous configuration. We decided to fix that 
>>>>>>> during
>>>>>>> the upgrade and have them use the default 5051 port.
>>>>>>>
>>>>>>> This change seems to make the health checks fail and eventually
>>>>>>> kills the slave due to inactivity.
>>>>>>>
>>>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>>>> configuration makes the slave successfully re-register and is not asked 
>>>>>>> to
>>>>>>> shutdown later on.
>>>>>>>
>>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Philippe
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to