Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vinodk...@gmail.com> wrote: > Can you paste some logs? > > On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <phili...@hopper.com> > wrote: > >> Ok, that's reasonable, but I'm not sure why it would successfully >> re-register with the master if it's not supposed to in the first place. I >> think changing the resources (for example) will dump the old configuration >> in the logs and tell you why recovery is bailing out. It's not doing that >> in this case. >> >> I looks as though this doesn't work only because the master can't ping >> the slave on the old port, because the whole recovery process was >> successful otherwise. >> >> I'm not sure if the slave could have picked up its configuration change >> and failed the recovery early, but that would definitely be a better >> experience. >> >> Philippe >> >> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodk...@gmail.com> wrote: >> >>> For slave recovery to work, it is expected to not change its config. >>> >>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <phili...@hopper.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves >>>> configured with checkpointing and with "reconnect" recovery. >>>> >>>> I was investigating why the slaves would successfully re-register with >>>> the master and recover, but would subsequently be asked to shutdown >>>> ("health check timeout"). >>>> >>>> It turns out that our slaves had been unintentionally configured to use >>>> port 5050 in the previous configuration. We decided to fix that during the >>>> upgrade and have them use the default 5051 port. >>>> >>>> This change seems to make the health checks fail and eventually kills >>>> the slave due to inactivity. >>>> >>>> I've confirmed that leaving the port to what it was in the previous >>>> configuration makes the slave successfully re-register and is not asked to >>>> shutdown later on. >>>> >>>> Is this a known issue? I haven't been able to find a JIRA ticket for >>>> this. Maybe it's the expected behaviour? Should I create a ticket? >>>> >>>> Thanks, >>>> Philippe >>>> >>> >>> >> >