Looks like this is due to a bug in versions < 23.0, where slave recovery didn't check for changes in 'port' when considering compatibility <https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137>. It has since been fixed in the upcoming 0.23.0 release.
On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme <[email protected]> wrote: > Checkpointing has been enabled since 0.18 on these slaves. The only other > setting that changed during the upgrade was that we added --gc_delay=1days. > Otherwise, it's an in-place upgrade without any changes to the work > directory... > > Philippe > > On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <[email protected]> wrote: > >> It is surprising that the slave didn't bail out during the initial phase >> of recovery when the port changed. I'm assuming you enabled checkpointing >> in 0.20.0 and that you didn't wipe the meta data directory or anything when >> upgrading to 21.0? >> >> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <[email protected]> >> wrote: >> >>> Here you are: >>> >>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c >>> >>> You can see in the mesos-master.INFO log that it re-registers the slave >>> using port :5050 (line 9) and fails the health checks on port :5051 (line >>> 10). So it might be the slave that re-uses the old configuration? >>> >>> Thanks, >>> Philippe >>> >>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <[email protected]> wrote: >>> >>>> Can you paste some logs? >>>> >>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <[email protected]> >>>> wrote: >>>> >>>>> Ok, that's reasonable, but I'm not sure why it would successfully >>>>> re-register with the master if it's not supposed to in the first place. I >>>>> think changing the resources (for example) will dump the old configuration >>>>> in the logs and tell you why recovery is bailing out. It's not doing that >>>>> in this case. >>>>> >>>>> I looks as though this doesn't work only because the master can't ping >>>>> the slave on the old port, because the whole recovery process was >>>>> successful otherwise. >>>>> >>>>> I'm not sure if the slave could have picked up its configuration >>>>> change and failed the recovery early, but that would definitely be a >>>>> better >>>>> experience. >>>>> >>>>> Philippe >>>>> >>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <[email protected]> >>>>> wrote: >>>>> >>>>>> For slave recovery to work, it is expected to not change its config. >>>>>> >>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves >>>>>>> configured with checkpointing and with "reconnect" recovery. >>>>>>> >>>>>>> I was investigating why the slaves would successfully re-register >>>>>>> with the master and recover, but would subsequently be asked to shutdown >>>>>>> ("health check timeout"). >>>>>>> >>>>>>> It turns out that our slaves had been unintentionally configured to >>>>>>> use port 5050 in the previous configuration. We decided to fix that >>>>>>> during >>>>>>> the upgrade and have them use the default 5051 port. >>>>>>> >>>>>>> This change seems to make the health checks fail and eventually >>>>>>> kills the slave due to inactivity. >>>>>>> >>>>>>> I've confirmed that leaving the port to what it was in the previous >>>>>>> configuration makes the slave successfully re-register and is not asked >>>>>>> to >>>>>>> shutdown later on. >>>>>>> >>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for >>>>>>> this. Maybe it's the expected behaviour? Should I create a ticket? >>>>>>> >>>>>>> Thanks, >>>>>>> Philippe >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >

