Re: [openstack-dev] [ironic] Tooling for recovering nodes
Hi Jay, Dmitry and guys I submit two patches to try recover nodes which stuck in deploying state: 1. fix the issue of the ironic-conductor was brought up with a different hostname. https://review.openstack.org/325026 2. clear the lock of nodes in maintenance states https://review.openstack.org/#/c/324269/ If above solutions are promising, then we don't need a new tool to recover nodes in deploying state. B.R Tan -Original Message- From: Jay Faulkner [mailto:j...@jvf.cc] Sent: Thursday, June 2, 2016 7:45 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes Some comments inline. On 5/31/16 12:26 PM, Devananda van der Veen wrote: > On 05/31/2016 01:35 AM, Dmitry Tantsur wrote: >> On 05/31/2016 10:25 AM, Tan, Lin wrote: >>> Hi, >>> >>> Recently, I am working on a spec[1] in order to recover nodes which >>> get stuck in deploying state, so I really expect some feedback from you >>> guys. >>> >>> Ironic nodes can be stuck in >>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the >>> node is reserved by a dead conductor (the exclusive lock was not released). >>> Any further requests will be denied by ironic because it thinks the >>> node resource is under control of another conductor. >>> >>> To be more clear, let's narrow the scope and focus on the deploying >>> state first. Currently, people do have several choices to clear the >>> reserved lock: >>> 1. restart the dead conductor >>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the >>> lock. >>> 3. The operator touches the DB to manually recover these nodes. >>> >>> Option two looks very promising but there are some weakness: >>> 2.1 It won't work if the dead conductor was renamed or deleted. >>> 2.2 It won't work if the node's specific driver was not enabled on >>> live conductors. >>> 2.3 It won't work if the node is in maintenance. (only a corner case). >> We can and should fix all three cases. > 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status(). > > The method claims to do exactly what you suggest in 2.1 and 2.2 -- it > gathers a list of Nodes reserved by *any* offline conductor and tries to > release the lock. > However, it will always fail to update them, because > objects.Node.release() raises a NodeLocked exception when called on a Node > locked by a different conductor. > > Here's the relevant code path: > > ironic/conductor/manager.py: > 1259 def _check_deploying_status(self, context): > ... > 1269 offline_conductors = self.dbapi.get_offline_conductors() > ... > 1273 node_iter = self.iter_nodes( > 1274 fields=['id', 'reservation'], > 1275 filters={'provision_state': states.DEPLOYING, > 1276 'maintenance': False, > 1277 'reserved_by_any_of': offline_conductors}) > ... > 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter: > 1285 try: > 1286 objects.Node.release(context, conductor_hostname, > node_id) > ... > 1292 except exception.NodeLocked: > 1293 LOG.warning(...) > 1297 continue > > > As far as 2.3, I think we should change the query string at the start > of this method so that it includes nodes in maintenance mode. I think > it's both safe and reasonable (and, frankly, what an operator will > expect) that a node which is in maintenance mode, and in DEPLOYING > state, whose conductor is offline, should have that reservation cleared and > be set to DEPLOYFAILED state. This is an excellent idea -- and I'm going to extend it further. If I have any nodes in a *ING state, and they are put into maintenance, it should force a failure. This is potentially a more API-friendly way of cleaning up nodes in bad states -- an operator would need to maintenance the node, and once it enters the *FAIL state, troubleshoot why it failed, unmaintenance, and return to production. I obviously strongly desire an "override command" as an operator, but I really think this could handle a large percentage of the use cases that made me desire it in the first place. > --devananda > >>> Definitely we should improve the option 2, but there are could be >>> more issues I didn't know in a more complicated environment. >>> So my question is do we still need a new command to recover these >>> node easier without accessing DB,
Re: [openstack-dev] [ironic] Tooling for recovering nodes
Some comments inline. On 5/31/16 12:26 PM, Devananda van der Veen wrote: On 05/31/2016 01:35 AM, Dmitry Tantsur wrote: On 05/31/2016 10:25 AM, Tan, Lin wrote: Hi, Recently, I am working on a spec[1] in order to recover nodes which get stuck in deploying state, so I really expect some feedback from you guys. Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is reserved by a dead conductor (the exclusive lock was not released). Any further requests will be denied by ironic because it thinks the node resource is under control of another conductor. To be more clear, let's narrow the scope and focus on the deploying state first. Currently, people do have several choices to clear the reserved lock: 1. restart the dead conductor 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock. 3. The operator touches the DB to manually recover these nodes. Option two looks very promising but there are some weakness: 2.1 It won't work if the dead conductor was renamed or deleted. 2.2 It won't work if the node's specific driver was not enabled on live conductors. 2.3 It won't work if the node is in maintenance. (only a corner case). We can and should fix all three cases. 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status(). The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a list of Nodes reserved by *any* offline conductor and tries to release the lock. However, it will always fail to update them, because objects.Node.release() raises a NodeLocked exception when called on a Node locked by a different conductor. Here's the relevant code path: ironic/conductor/manager.py: 1259 def _check_deploying_status(self, context): ... 1269 offline_conductors = self.dbapi.get_offline_conductors() ... 1273 node_iter = self.iter_nodes( 1274 fields=['id', 'reservation'], 1275 filters={'provision_state': states.DEPLOYING, 1276 'maintenance': False, 1277 'reserved_by_any_of': offline_conductors}) ... 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter: 1285 try: 1286 objects.Node.release(context, conductor_hostname, node_id) ... 1292 except exception.NodeLocked: 1293 LOG.warning(...) 1297 continue As far as 2.3, I think we should change the query string at the start of this method so that it includes nodes in maintenance mode. I think it's both safe and reasonable (and, frankly, what an operator will expect) that a node which is in maintenance mode, and in DEPLOYING state, whose conductor is offline, should have that reservation cleared and be set to DEPLOYFAILED state. This is an excellent idea -- and I'm going to extend it further. If I have any nodes in a *ING state, and they are put into maintenance, it should force a failure. This is potentially a more API-friendly way of cleaning up nodes in bad states -- an operator would need to maintenance the node, and once it enters the *FAIL state, troubleshoot why it failed, unmaintenance, and return to production. I obviously strongly desire an "override command" as an operator, but I really think this could handle a large percentage of the use cases that made me desire it in the first place. --devananda Definitely we should improve the option 2, but there are could be more issues I didn't know in a more complicated environment. So my question is do we still need a new command to recover these node easier without accessing DB, like this PoC [2]: ironic-noderecover --node_uuids=UUID1,UUID2 --config-file=/etc/ironic/ironic.conf I'm -1 to anything silently removing the lock until I see a clear use case which is impossible to improve within Ironic itself. Such utility may and will be abused. I'm fine with anything that does not forcibly remove the lock by default. I agree such a utility could be abused. I don't think that's a good argument for not writing it for operators. However, I agree that any utility we write that could or would modify a lock should not do so by default, and should warn before doing so, but there are cases where getting a lock cleared is desirable and necessary. A good example of this would be an ironic-conductor failing while a node is locked, and being brought up with a different hostname. Today, there's no way to get that lock off that node again. Even if you force operators to replace a conductor with one with an identical hostname, during the time this replacement was occurring any nodes locked would remain locked. Thanks, Jay Faulkner Best Regards, Tan [1] https://review.openstack.org/#/c/319812 [2] https://review.openstack.org/#/c/311273/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@l
Re: [openstack-dev] [ironic] Tooling for recovering nodes
Hey Tan, some comments inline. On 5/31/16 1:25 AM, Tan, Lin wrote: Hi, Recently, I am working on a spec[1] in order to recover nodes which get stuck in deploying state, so I really expect some feedback from you guys. Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is reserved by a dead conductor (the exclusive lock was not released). Any further requests will be denied by ironic because it thinks the node resource is under control of another conductor. To be more clear, let's narrow the scope and focus on the deploying state first. Currently, people do have several choices to clear the reserved lock: 1. restart the dead conductor 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock. 3. The operator touches the DB to manually recover these nodes. I actually like option #3 being optionally integrated into a tool to clear nodes stuck in *ing state. If specified, it would clear the lock on the deploy as it moved it from DEPLOYING -> DEPLOYFAILED. Obviously, for cleaning this could be dangerous, and should be documented as so -- imagine clearing a lock mid-firmware flash and having a power action taken to brick the node. Given this is tooling intended to handle many cases, I think it's better to give the operator the choice to take more dramatic action if they wish. Thanks, Jay Faulkner Option two looks very promising but there are some weakness: 2.1 It won't work if the dead conductor was renamed or deleted. 2.2 It won't work if the node's specific driver was not enabled on live conductors. 2.3 It won't work if the node is in maintenance. (only a corner case). Definitely we should improve the option 2, but there are could be more issues I didn't know in a more complicated environment. So my question is do we still need a new command to recover these node easier without accessing DB, like this PoC [2]: ironic-noderecover --node_uuids=UUID1,UUID2 --config-file=/etc/ironic/ironic.conf Best Regards, Tan [1] https://review.openstack.org/#/c/319812 [2] https://review.openstack.org/#/c/311273/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] Tooling for recovering nodes
Thanks Devananda for your suggestions. I opened a new bug for it. But I am asking this is because this is a task from newton summit to create a new command "for getting nodes out of stuck *ing states" https://etherpad.openstack.org/p/ironic-newton-summit-ops And we have a RFE bug already for this[1] But as Dmitry said, there is a big risk to remove the lock of nodes and mark it as deploy failed state. But if the tool didn't remove the lock of nodes, then users still cannot manipulate the node resource. So I want to involve more people to discuss the spec[2]. Considering ironic already have _check_deploying_states() to recover deploying state, should I focus on improving it? Or There is still a need to create a new command. B.R Tan [1]https://bugs.launchpad.net/ironic/+bug/1580931 [2]https://review.openstack.org/#/c/319812 -Original Message- From: Devananda van der Veen [mailto:devananda@gmail.com] Sent: Wednesday, June 1, 2016 3:26 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes On 05/31/2016 01:35 AM, Dmitry Tantsur wrote: > On 05/31/2016 10:25 AM, Tan, Lin wrote: >> Hi, >> >> Recently, I am working on a spec[1] in order to recover nodes which >> get stuck in deploying state, so I really expect some feedback from you guys. >> >> Ironic nodes can be stuck in >> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the >> node is reserved by a dead conductor (the exclusive lock was not released). >> Any further requests will be denied by ironic because it thinks the >> node resource is under control of another conductor. >> >> To be more clear, let's narrow the scope and focus on the deploying >> state first. Currently, people do have several choices to clear the reserved >> lock: >> 1. restart the dead conductor >> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the >> lock. >> 3. The operator touches the DB to manually recover these nodes. >> >> Option two looks very promising but there are some weakness: >> 2.1 It won't work if the dead conductor was renamed or deleted. >> 2.2 It won't work if the node's specific driver was not enabled on >> live conductors. >> 2.3 It won't work if the node is in maintenance. (only a corner case). > > We can and should fix all three cases. 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status(). The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a list of Nodes reserved by *any* offline conductor and tries to release the lock. However, it will always fail to update them, because objects.Node.release() raises a NodeLocked exception when called on a Node locked by a different conductor. Here's the relevant code path: ironic/conductor/manager.py: 1259 def _check_deploying_status(self, context): ... 1269 offline_conductors = self.dbapi.get_offline_conductors() ... 1273 node_iter = self.iter_nodes( 1274 fields=['id', 'reservation'], 1275 filters={'provision_state': states.DEPLOYING, 1276 'maintenance': False, 1277 'reserved_by_any_of': offline_conductors}) ... 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter: 1285 try: 1286 objects.Node.release(context, conductor_hostname, node_id) ... 1292 except exception.NodeLocked: 1293 LOG.warning(...) 1297 continue As far as 2.3, I think we should change the query string at the start of this method so that it includes nodes in maintenance mode. I think it's both safe and reasonable (and, frankly, what an operator will expect) that a node which is in maintenance mode, and in DEPLOYING state, whose conductor is offline, should have that reservation cleared and be set to DEPLOYFAILED state. --devananda >> >> Definitely we should improve the option 2, but there are could be >> more issues I didn't know in a more complicated environment. >> So my question is do we still need a new command to recover these >> node easier without accessing DB, like this PoC [2]: >> ironic-noderecover --node_uuids=UUID1,UUID2 >> --config-file=/etc/ironic/ironic.conf > > I'm -1 to anything silently removing the lock until I see a clear use > case which is impossible to improve within Ironic itself. Such utility may > and will be abused. > > I'm fine with anything that does not forcibly remove the lock by default. > >> >> Best Regards, >> >> Tan >> >> >>
Re: [openstack-dev] [ironic] Tooling for recovering nodes
On 05/31/2016 01:35 AM, Dmitry Tantsur wrote: > On 05/31/2016 10:25 AM, Tan, Lin wrote: >> Hi, >> >> Recently, I am working on a spec[1] in order to recover nodes which get stuck >> in deploying state, so I really expect some feedback from you guys. >> >> Ironic nodes can be stuck in >> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is >> reserved by a dead conductor (the exclusive lock was not released). >> Any further requests will be denied by ironic because it thinks the node >> resource is under control of another conductor. >> >> To be more clear, let's narrow the scope and focus on the deploying state >> first. Currently, people do have several choices to clear the reserved lock: >> 1. restart the dead conductor >> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the >> lock. >> 3. The operator touches the DB to manually recover these nodes. >> >> Option two looks very promising but there are some weakness: >> 2.1 It won't work if the dead conductor was renamed or deleted. >> 2.2 It won't work if the node's specific driver was not enabled on live >> conductors. >> 2.3 It won't work if the node is in maintenance. (only a corner case). > > We can and should fix all three cases. 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status(). The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a list of Nodes reserved by *any* offline conductor and tries to release the lock. However, it will always fail to update them, because objects.Node.release() raises a NodeLocked exception when called on a Node locked by a different conductor. Here's the relevant code path: ironic/conductor/manager.py: 1259 def _check_deploying_status(self, context): ... 1269 offline_conductors = self.dbapi.get_offline_conductors() ... 1273 node_iter = self.iter_nodes( 1274 fields=['id', 'reservation'], 1275 filters={'provision_state': states.DEPLOYING, 1276 'maintenance': False, 1277 'reserved_by_any_of': offline_conductors}) ... 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter: 1285 try: 1286 objects.Node.release(context, conductor_hostname, node_id) ... 1292 except exception.NodeLocked: 1293 LOG.warning(...) 1297 continue As far as 2.3, I think we should change the query string at the start of this method so that it includes nodes in maintenance mode. I think it's both safe and reasonable (and, frankly, what an operator will expect) that a node which is in maintenance mode, and in DEPLOYING state, whose conductor is offline, should have that reservation cleared and be set to DEPLOYFAILED state. --devananda >> >> Definitely we should improve the option 2, but there are could be more issues >> I didn't know in a more complicated environment. >> So my question is do we still need a new command to recover these node easier >> without accessing DB, like this PoC [2]: >> ironic-noderecover --node_uuids=UUID1,UUID2 >> --config-file=/etc/ironic/ironic.conf > > I'm -1 to anything silently removing the lock until I see a clear use case > which > is impossible to improve within Ironic itself. Such utility may and will be > abused. > > I'm fine with anything that does not forcibly remove the lock by default. > >> >> Best Regards, >> >> Tan >> >> >> [1] https://review.openstack.org/#/c/319812 >> [2] https://review.openstack.org/#/c/311273/ >> __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] Tooling for recovering nodes
On 05/31/2016 10:25 AM, Tan, Lin wrote: Hi, Recently, I am working on a spec[1] in order to recover nodes which get stuck in deploying state, so I really expect some feedback from you guys. Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is reserved by a dead conductor (the exclusive lock was not released). Any further requests will be denied by ironic because it thinks the node resource is under control of another conductor. To be more clear, let's narrow the scope and focus on the deploying state first. Currently, people do have several choices to clear the reserved lock: 1. restart the dead conductor 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock. 3. The operator touches the DB to manually recover these nodes. Option two looks very promising but there are some weakness: 2.1 It won't work if the dead conductor was renamed or deleted. 2.2 It won't work if the node's specific driver was not enabled on live conductors. 2.3 It won't work if the node is in maintenance. (only a corner case). We can and should fix all three cases. Definitely we should improve the option 2, but there are could be more issues I didn't know in a more complicated environment. So my question is do we still need a new command to recover these node easier without accessing DB, like this PoC [2]: ironic-noderecover --node_uuids=UUID1,UUID2 --config-file=/etc/ironic/ironic.conf I'm -1 to anything silently removing the lock until I see a clear use case which is impossible to improve within Ironic itself. Such utility may and will be abused. I'm fine with anything that does not forcibly remove the lock by default. Best Regards, Tan [1] https://review.openstack.org/#/c/319812 [2] https://review.openstack.org/#/c/311273/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [ironic] Tooling for recovering nodes
Hi, Recently, I am working on a spec[1] in order to recover nodes which get stuck in deploying state, so I really expect some feedback from you guys. Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is reserved by a dead conductor (the exclusive lock was not released). Any further requests will be denied by ironic because it thinks the node resource is under control of another conductor. To be more clear, let's narrow the scope and focus on the deploying state first. Currently, people do have several choices to clear the reserved lock: 1. restart the dead conductor 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock. 3. The operator touches the DB to manually recover these nodes. Option two looks very promising but there are some weakness: 2.1 It won't work if the dead conductor was renamed or deleted. 2.2 It won't work if the node's specific driver was not enabled on live conductors. 2.3 It won't work if the node is in maintenance. (only a corner case). Definitely we should improve the option 2, but there are could be more issues I didn't know in a more complicated environment. So my question is do we still need a new command to recover these node easier without accessing DB, like this PoC [2]: ironic-noderecover --node_uuids=UUID1,UUID2 --config-file=/etc/ironic/ironic.conf Best Regards, Tan [1] https://review.openstack.org/#/c/319812 [2] https://review.openstack.org/#/c/311273/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev