Re: Mesos Slave Port Change Fails Recovery

2015-07-03 Thread Vinod Kone
Looks like this is due to a bug in versions  23.0, where slave recovery
didn't check for changes in 'port' when considering compatibility
https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137.
It has since been fixed in the upcoming 0.23.0 release.

On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme phili...@hopper.com
wrote:

 Checkpointing has been enabled since 0.18 on these slaves. The only other
 setting that changed during the upgrade was that we added --gc_delay=1days.
 Otherwise, it's an in-place upgrade without any changes to the work
 directory...

 Philippe

 On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote:

 It is surprising that the slave didn't bail out during the initial phase
 of recovery when the port changed. I'm assuming you enabled checkpointing
 in 0.20.0 and that you didn't wipe the meta data directory or anything when
 upgrading to 21.0?

 On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Here you are:

 https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

 You can see in the mesos-master.INFO log that it re-registers the slave
 using port :5050 (line 9) and fails the health checks on port :5051 (line
 10). So it might be the slave that re-uses the old configuration?

 Thanks,
 Philippe

 On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration
 change and failed the recovery early, but that would definitely be a 
 better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com
 wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme 
 phili...@hopper.com wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register
 with the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to
 use port 5050 in the previous configuration. We decided to fix that 
 during
 the upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually
 kills the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked 
 to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe










Re: Mesos Slave Port Change Fails Recovery

2015-07-03 Thread Philippe Laflamme
Awesome!

We've reverted to the previous port and all our slaves have recovered
nicely.

Thanks for looking into this,
Philippe

On Fri, Jul 3, 2015 at 3:27 PM, Vinod Kone vinodk...@gmail.com wrote:

 Looks like this is due to a bug in versions  23.0, where slave recovery
 didn't check for changes in 'port' when considering compatibility
 https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137.
 It has since been fixed in the upcoming 0.23.0 release.

 On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Checkpointing has been enabled since 0.18 on these slaves. The only other
 setting that changed during the upgrade was that we added --gc_delay=1days.
 Otherwise, it's an in-place upgrade without any changes to the work
 directory...

 Philippe

 On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote:

 It is surprising that the slave didn't bail out during the initial phase
 of recovery when the port changed. I'm assuming you enabled checkpointing
 in 0.20.0 and that you didn't wipe the meta data directory or anything when
 upgrading to 21.0?

 On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Here you are:

 https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

 You can see in the mesos-master.INFO log that it re-registers the slave
 using port :5050 (line 9) and fails the health checks on port :5051 (line
 10). So it might be the slave that re-uses the old configuration?

 Thanks,
 Philippe

 On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
  wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old 
 configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't
 ping the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration
 change and failed the recovery early, but that would definitely be a 
 better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com
 wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme 
 phili...@hopper.com wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register
 with the master and recover, but would subsequently be asked to 
 shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to
 use port 5050 in the previous configuration. We decided to fix that 
 during
 the upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually
 kills the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not 
 asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket
 for this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe











Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Philippe Laflamme
Checkpointing has been enabled since 0.18 on these slaves. The only other
setting that changed during the upgrade was that we added --gc_delay=1days.
Otherwise, it's an in-place upgrade without any changes to the work
directory...

Philippe

On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote:

 It is surprising that the slave didn't bail out during the initial phase
 of recovery when the port changed. I'm assuming you enabled checkpointing
 in 0.20.0 and that you didn't wipe the meta data directory or anything when
 upgrading to 21.0?

 On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Here you are:

 https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

 You can see in the mesos-master.INFO log that it re-registers the slave
 using port :5050 (line 9) and fails the health checks on port :5051 (line
 10). So it might be the slave that re-uses the old configuration?

 Thanks,
 Philippe

 On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration change
 and failed the recovery early, but that would definitely be a better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
  wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register
 with the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to
 use port 5050 in the previous configuration. We decided to fix that 
 during
 the upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills
 the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked 
 to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe









Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Vinod Kone
It is surprising that the slave didn't bail out during the initial phase of
recovery when the port changed. I'm assuming you enabled checkpointing in
0.20.0 and that you didn't wipe the meta data directory or anything when
upgrading to 21.0?

On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com
wrote:

 Here you are:

 https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

 You can see in the mesos-master.INFO log that it re-registers the slave
 using port :5050 (line 9) and fails the health checks on port :5051 (line
 10). So it might be the slave that re-uses the old configuration?

 Thanks,
 Philippe

 On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration change
 and failed the recovery early, but that would definitely be a better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register with
 the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to
 use port 5050 in the previous configuration. We decided to fix that during
 the upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills
 the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe








Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Vinod Kone
For slave recovery to work, it is expected to not change its config.

On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register with the
 master and recover, but would subsequently be asked to shutdown (health
 check timeout).

 It turns out that our slaves had been unintentionally configured to use
 port 5050 in the previous configuration. We decided to fix that during the
 upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills the
 slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for this.
 Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe



Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Philippe Laflamme
Hi,

I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
configured with checkpointing and with reconnect recovery.

I was investigating why the slaves would successfully re-register with the
master and recover, but would subsequently be asked to shutdown (health
check timeout).

It turns out that our slaves had been unintentionally configured to use
port 5050 in the previous configuration. We decided to fix that during the
upgrade and have them use the default 5051 port.

This change seems to make the health checks fail and eventually kills the
slave due to inactivity.

I've confirmed that leaving the port to what it was in the previous
configuration makes the slave successfully re-register and is not asked to
shutdown later on.

Is this a known issue? I haven't been able to find a JIRA ticket for this.
Maybe it's the expected behaviour? Should I create a ticket?

Thanks,
Philippe


Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Philippe Laflamme
Here you are:

https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

You can see in the mesos-master.INFO log that it re-registers the slave
using port :5050 (line 9) and fails the health checks on port :5051 (line
10). So it might be the slave that re-uses the old configuration?

Thanks,
Philippe

On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration change
 and failed the recovery early, but that would definitely be a better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register with
 the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to use
 port 5050 in the previous configuration. We decided to fix that during the
 upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills
 the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe