Checkpointing has been enabled since 0.18 on these slaves. The only other setting that changed during the upgrade was that we added --gc_delay=1days. Otherwise, it's an in-place upgrade without any changes to the work directory...
Philippe On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <[email protected]> wrote: > It is surprising that the slave didn't bail out during the initial phase > of recovery when the port changed. I'm assuming you enabled checkpointing > in 0.20.0 and that you didn't wipe the meta data directory or anything when > upgrading to 21.0? > > On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <[email protected]> > wrote: > >> Here you are: >> >> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c >> >> You can see in the mesos-master.INFO log that it re-registers the slave >> using port :5050 (line 9) and fails the health checks on port :5051 (line >> 10). So it might be the slave that re-uses the old configuration? >> >> Thanks, >> Philippe >> >> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <[email protected]> wrote: >> >>> Can you paste some logs? >>> >>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <[email protected]> >>> wrote: >>> >>>> Ok, that's reasonable, but I'm not sure why it would successfully >>>> re-register with the master if it's not supposed to in the first place. I >>>> think changing the resources (for example) will dump the old configuration >>>> in the logs and tell you why recovery is bailing out. It's not doing that >>>> in this case. >>>> >>>> I looks as though this doesn't work only because the master can't ping >>>> the slave on the old port, because the whole recovery process was >>>> successful otherwise. >>>> >>>> I'm not sure if the slave could have picked up its configuration change >>>> and failed the recovery early, but that would definitely be a better >>>> experience. >>>> >>>> Philippe >>>> >>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <[email protected]> wrote: >>>> >>>>> For slave recovery to work, it is expected to not change its config. >>>>> >>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <[email protected] >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves >>>>>> configured with checkpointing and with "reconnect" recovery. >>>>>> >>>>>> I was investigating why the slaves would successfully re-register >>>>>> with the master and recover, but would subsequently be asked to shutdown >>>>>> ("health check timeout"). >>>>>> >>>>>> It turns out that our slaves had been unintentionally configured to >>>>>> use port 5050 in the previous configuration. We decided to fix that >>>>>> during >>>>>> the upgrade and have them use the default 5051 port. >>>>>> >>>>>> This change seems to make the health checks fail and eventually kills >>>>>> the slave due to inactivity. >>>>>> >>>>>> I've confirmed that leaving the port to what it was in the previous >>>>>> configuration makes the slave successfully re-register and is not asked >>>>>> to >>>>>> shutdown later on. >>>>>> >>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for >>>>>> this. Maybe it's the expected behaviour? Should I create a ticket? >>>>>> >>>>>> Thanks, >>>>>> Philippe >>>>>> >>>>> >>>>> >>>> >>> >> >

