Re: Mesos Slave Port Change Fails Recovery
Looks like this is due to a bug in versions 23.0, where slave recovery didn't check for changes in 'port' when considering compatibility https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137. It has since been fixed in the upcoming 0.23.0 release. On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme phili...@hopper.com wrote: Checkpointing has been enabled since 0.18 on these slaves. The only other setting that changed during the upgrade was that we added --gc_delay=1days. Otherwise, it's an in-place upgrade without any changes to the work directory... Philippe On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote: It is surprising that the slave didn't bail out during the initial phase of recovery when the port changed. I'm assuming you enabled checkpointing in 0.20.0 and that you didn't wipe the meta data directory or anything when upgrading to 21.0? On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com wrote: Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
Awesome! We've reverted to the previous port and all our slaves have recovered nicely. Thanks for looking into this, Philippe On Fri, Jul 3, 2015 at 3:27 PM, Vinod Kone vinodk...@gmail.com wrote: Looks like this is due to a bug in versions 23.0, where slave recovery didn't check for changes in 'port' when considering compatibility https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137. It has since been fixed in the upcoming 0.23.0 release. On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme phili...@hopper.com wrote: Checkpointing has been enabled since 0.18 on these slaves. The only other setting that changed during the upgrade was that we added --gc_delay=1days. Otherwise, it's an in-place upgrade without any changes to the work directory... Philippe On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote: It is surprising that the slave didn't bail out during the initial phase of recovery when the port changed. I'm assuming you enabled checkpointing in 0.20.0 and that you didn't wipe the meta data directory or anything when upgrading to 21.0? On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com wrote: Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
Checkpointing has been enabled since 0.18 on these slaves. The only other setting that changed during the upgrade was that we added --gc_delay=1days. Otherwise, it's an in-place upgrade without any changes to the work directory... Philippe On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote: It is surprising that the slave didn't bail out during the initial phase of recovery when the port changed. I'm assuming you enabled checkpointing in 0.20.0 and that you didn't wipe the meta data directory or anything when upgrading to 21.0? On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com wrote: Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
It is surprising that the slave didn't bail out during the initial phase of recovery when the port changed. I'm assuming you enabled checkpointing in 0.20.0 and that you didn't wipe the meta data directory or anything when upgrading to 21.0? On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com wrote: Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Mesos Slave Port Change Fails Recovery
Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe