Ok, that's reasonable, but I'm not sure why it would successfully
re-register with the master if it's not supposed to in the first place. I
think changing the resources (for example) will dump the old configuration
in the logs and tell you why recovery is bailing out. It's not doing that
in this case.

I looks as though this doesn't work only because the master can't ping the
slave on the old port, because the whole recovery process was successful
otherwise.

I'm not sure if the slave could have picked up its configuration change and
failed the recovery early, but that would definitely be a better experience.

Philippe

On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <[email protected]> wrote:

> For slave recovery to work, it is expected to not change its config.
>
> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <[email protected]>
> wrote:
>
>> Hi,
>>
>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>> configured with checkpointing and with "reconnect" recovery.
>>
>> I was investigating why the slaves would successfully re-register with
>> the master and recover, but would subsequently be asked to shutdown
>> ("health check timeout").
>>
>> It turns out that our slaves had been unintentionally configured to use
>> port 5050 in the previous configuration. We decided to fix that during the
>> upgrade and have them use the default 5051 port.
>>
>> This change seems to make the health checks fail and eventually kills the
>> slave due to inactivity.
>>
>> I've confirmed that leaving the port to what it was in the previous
>> configuration makes the slave successfully re-register and is not asked to
>> shutdown later on.
>>
>> Is this a known issue? I haven't been able to find a JIRA ticket for
>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>
>> Thanks,
>> Philippe
>>
>
>

Reply via email to