Re: [openstack-dev] [ironic] Tooling for recovering nodes

2016-07-11 Thread Tan, Lin
Hi Jay, Dmitry and guys

I submit two patches to try recover nodes which stuck in deploying state:

1. fix the issue of the ironic-conductor was brought up with a different 
hostname.
 https://review.openstack.org/325026
2. clear the lock of nodes in maintenance states
 https://review.openstack.org/#/c/324269/

If above solutions are promising, then we don't need a new tool to recover 
nodes in deploying state.

B.R

Tan



-Original Message-
From: Jay Faulkner [mailto:j...@jvf.cc] 
Sent: Thursday, June 2, 2016 7:45 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes

Some comments inline.


On 5/31/16 12:26 PM, Devananda van der Veen wrote:
> On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
>> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>>> Hi,
>>>
>>> Recently, I am working on a spec[1] in order to recover nodes which 
>>> get stuck in deploying state, so I really expect some feedback from you 
>>> guys.
>>>
>>> Ironic nodes can be stuck in
>>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the 
>>> node is reserved by a dead conductor (the exclusive lock was not released).
>>> Any further requests will be denied by ironic because it thinks the 
>>> node resource is under control of another conductor.
>>>
>>> To be more clear, let's narrow the scope and focus on the deploying 
>>> state first. Currently, people do have several choices to clear the 
>>> reserved lock:
>>> 1. restart the dead conductor
>>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the 
>>> lock.
>>> 3. The operator touches the DB to manually recover these nodes.
>>>
>>> Option two looks very promising but there are some weakness:
>>> 2.1 It won't work if the dead conductor was renamed or deleted.
>>> 2.2 It won't work if the node's specific driver was not enabled on 
>>> live conductors.
>>> 2.3 It won't work if the node is in maintenance. (only a corner case).
>> We can and should fix all three cases.
> 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().
>
> The method claims to do exactly what you suggest in 2.1 and 2.2 -- it 
> gathers a list of Nodes reserved by *any* offline conductor and tries to 
> release the lock.
> However, it will always fail to update them, because 
> objects.Node.release() raises a NodeLocked exception when called on a Node 
> locked by a different conductor.
>
> Here's the relevant code path:
>
> ironic/conductor/manager.py:
> 1259 def _check_deploying_status(self, context):
> ...
> 1269 offline_conductors = self.dbapi.get_offline_conductors()
> ...
> 1273 node_iter = self.iter_nodes(
> 1274 fields=['id', 'reservation'],
> 1275 filters={'provision_state': states.DEPLOYING,
> 1276  'maintenance': False,
> 1277  'reserved_by_any_of': offline_conductors})
> ...
> 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter:
> 1285 try:
> 1286 objects.Node.release(context, conductor_hostname, 
> node_id)
> ...
> 1292 except exception.NodeLocked:
> 1293 LOG.warning(...)
> 1297 continue
>
>
> As far as 2.3, I think we should change the query string at the start 
> of this method so that it includes nodes in maintenance mode. I think 
> it's both safe and reasonable (and, frankly, what an operator will 
> expect) that a node which is in maintenance mode, and in DEPLOYING 
> state, whose conductor is offline, should have that reservation cleared and 
> be set to DEPLOYFAILED state.

This is an excellent idea -- and I'm going to extend it further. If I have any 
nodes in a *ING state, and they are put into maintenance, it should force a 
failure. This is potentially a more API-friendly way of cleaning up nodes in 
bad states -- an operator would need to maintenance the node, and once it 
enters the *FAIL state, troubleshoot why it failed, unmaintenance, and return 
to production.

I obviously strongly desire an "override command" as an operator, but I really 
think this could handle a large percentage of the use cases that made me desire 
it in the first place.

> --devananda
>
>>> Definitely we should improve the option 2, but there are could be 
>>> more issues I didn't know in a more complicated environment.
>>> So my question is do we still need a new command to recover these 
>>> node easier without accessing DB, like this PoC [2]:
>>>ironic-noderecover --node_uuids=UUID1,UUID2 
>>> --confi

Re: [openstack-dev] [ironic] Tooling for recovering nodes

2016-06-01 Thread Jay Faulkner

Some comments inline.


On 5/31/16 12:26 PM, Devananda van der Veen wrote:

On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:

On 05/31/2016 10:25 AM, Tan, Lin wrote:

Hi,

Recently, I am working on a spec[1] in order to recover nodes which get stuck
in deploying state, so I really expect some feedback from you guys.

Ironic nodes can be stuck in
deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is
reserved by a dead conductor (the exclusive lock was not released).
Any further requests will be denied by ironic because it thinks the node
resource is under control of another conductor.

To be more clear, let's narrow the scope and focus on the deploying state
first. Currently, people do have several choices to clear the reserved lock:
1. restart the dead conductor
2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
3. The operator touches the DB to manually recover these nodes.

Option two looks very promising but there are some weakness:
2.1 It won't work if the dead conductor was renamed or deleted.
2.2 It won't work if the node's specific driver was not enabled on live
conductors.
2.3 It won't work if the node is in maintenance. (only a corner case).

We can and should fix all three cases.

2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().

The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a
list of Nodes reserved by *any* offline conductor and tries to release the lock.
However, it will always fail to update them, because objects.Node.release()
raises a NodeLocked exception when called on a Node locked by a different 
conductor.

Here's the relevant code path:

ironic/conductor/manager.py:
1259 def _check_deploying_status(self, context):
...
1269 offline_conductors = self.dbapi.get_offline_conductors()
...
1273 node_iter = self.iter_nodes(
1274 fields=['id', 'reservation'],
1275 filters={'provision_state': states.DEPLOYING,
1276  'maintenance': False,
1277  'reserved_by_any_of': offline_conductors})
...
1281 for node_uuid, driver, node_id, conductor_hostname in node_iter:
1285 try:
1286 objects.Node.release(context, conductor_hostname, node_id)
...
1292 except exception.NodeLocked:
1293 LOG.warning(...)
1297 continue


As far as 2.3, I think we should change the query string at the start of this
method so that it includes nodes in maintenance mode. I think it's both safe and
reasonable (and, frankly, what an operator will expect) that a node which is in
maintenance mode, and in DEPLOYING state, whose conductor is offline, should
have that reservation cleared and be set to DEPLOYFAILED state.


This is an excellent idea -- and I'm going to extend it further. If I 
have any nodes in a *ING state, and they are put into maintenance, it 
should force a failure. This is potentially a more API-friendly way of 
cleaning up nodes in bad states -- an operator would need to maintenance 
the node, and once it enters the *FAIL state, troubleshoot why it 
failed, unmaintenance, and return to production.


I obviously strongly desire an "override command" as an operator, but I 
really think this could handle a large percentage of the use cases that 
made me desire it in the first place.



--devananda


Definitely we should improve the option 2, but there are could be more issues
I didn't know in a more complicated environment.
So my question is do we still need a new command to recover these node easier
without accessing DB, like this PoC [2]:
   ironic-noderecover --node_uuids=UUID1,UUID2
--config-file=/etc/ironic/ironic.conf

I'm -1 to anything silently removing the lock until I see a clear use case which
is impossible to improve within Ironic itself. Such utility may and will be 
abused.

I'm fine with anything that does not forcibly remove the lock by default.
I agree such a utility could be abused. I don't think that's a good 
argument for not writing it for operators. However, I agree that any 
utility we write that could or would modify a lock should not do so by 
default, and should warn before doing so, but there are cases where 
getting a lock cleared is desirable and necessary.


A good example of this would be an ironic-conductor failing while a node 
is locked, and being brought up with a different hostname. Today, 
there's no way to get that lock off that node again.


Even if you force operators to replace a conductor with one with an 
identical hostname, during the time this replacement was occurring any 
nodes locked would remain locked.


Thanks,
Jay Faulkner

Best Regards,

Tan


[1] https://review.openstack.org/#/c/319812
[2] https://review.openstack.org/#/c/311273/


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [ironic] Tooling for recovering nodes

2016-06-01 Thread Jay Faulkner

Hey Tan, some comments inline.


On 5/31/16 1:25 AM, Tan, Lin wrote:

Hi,

Recently, I am working on a spec[1] in order to recover nodes which get stuck 
in deploying state, so I really expect some feedback from you guys.

Ironic nodes can be stuck in 
deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is 
reserved by a dead conductor (the exclusive lock was not released).
Any further requests will be denied by ironic because it thinks the node 
resource is under control of another conductor.

To be more clear, let's narrow the scope and focus on the deploying state 
first. Currently, people do have several choices to clear the reserved lock:
1. restart the dead conductor
2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
3. The operator touches the DB to manually recover these nodes.
I actually like option #3 being optionally integrated into a tool to 
clear nodes stuck in *ing state. If specified, it would clear the lock 
on the deploy as it moved it from DEPLOYING -> DEPLOYFAILED. Obviously, 
for cleaning this could be dangerous, and should be documented as so -- 
imagine clearing a lock mid-firmware flash and having a power action 
taken to brick the node.


Given this is tooling intended to handle many cases, I think it's better 
to give the operator the choice to take more dramatic action if they wish.



Thanks,
Jay Faulkner

Option two looks very promising but there are some weakness:
2.1 It won't work if the dead conductor was renamed or deleted.
2.2 It won't work if the node's specific driver was not enabled on live 
conductors.
2.3 It won't work if the node is in maintenance. (only a corner case).

Definitely we should improve the option 2, but there are could be more issues I 
didn't know in a more complicated environment.
So my question is do we still need a new command to recover these node easier 
without accessing DB, like this PoC [2]:
   ironic-noderecover --node_uuids=UUID1,UUID2  
--config-file=/etc/ironic/ironic.conf

Best Regards,

Tan


[1] https://review.openstack.org/#/c/319812
[2] https://review.openstack.org/#/c/311273/


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic] Tooling for recovering nodes

2016-05-31 Thread Tan, Lin
Thanks Devananda for your suggestions. I opened a new bug for it.

But I am asking this is because this is a task from newton summit to create a 
new command "for getting nodes out of stuck *ing states"
https://etherpad.openstack.org/p/ironic-newton-summit-ops
And we have a RFE bug already for this[1]

But as Dmitry said, there is a big risk to remove the lock of nodes and mark it 
as deploy failed state. But if the tool didn't remove the lock of nodes, then 
users still cannot manipulate the node resource. So I want to involve more 
people to discuss the spec[2].

Considering ironic already have _check_deploying_states() to recover deploying 
state, should I focus on improving it?
Or
There is still a need to create a new command.

B.R

Tan

[1]https://bugs.launchpad.net/ironic/+bug/1580931
[2]https://review.openstack.org/#/c/319812
-Original Message-
From: Devananda van der Veen [mailto:devananda@gmail.com] 
Sent: Wednesday, June 1, 2016 3:26 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes

On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>> Hi,
>>
>> Recently, I am working on a spec[1] in order to recover nodes which 
>> get stuck in deploying state, so I really expect some feedback from you guys.
>>
>> Ironic nodes can be stuck in
>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the 
>> node is reserved by a dead conductor (the exclusive lock was not released).
>> Any further requests will be denied by ironic because it thinks the 
>> node resource is under control of another conductor.
>>
>> To be more clear, let's narrow the scope and focus on the deploying 
>> state first. Currently, people do have several choices to clear the reserved 
>> lock:
>> 1. restart the dead conductor
>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the 
>> lock.
>> 3. The operator touches the DB to manually recover these nodes.
>>
>> Option two looks very promising but there are some weakness:
>> 2.1 It won't work if the dead conductor was renamed or deleted.
>> 2.2 It won't work if the node's specific driver was not enabled on 
>> live conductors.
>> 2.3 It won't work if the node is in maintenance. (only a corner case).
> 
> We can and should fix all three cases.

2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().

The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a 
list of Nodes reserved by *any* offline conductor and tries to release the lock.
However, it will always fail to update them, because objects.Node.release() 
raises a NodeLocked exception when called on a Node locked by a different 
conductor.

Here's the relevant code path:

ironic/conductor/manager.py:
1259 def _check_deploying_status(self, context):
...
1269 offline_conductors = self.dbapi.get_offline_conductors()
...
1273 node_iter = self.iter_nodes(
1274 fields=['id', 'reservation'],
1275 filters={'provision_state': states.DEPLOYING,
1276  'maintenance': False,
1277  'reserved_by_any_of': offline_conductors})
...
1281 for node_uuid, driver, node_id, conductor_hostname in node_iter:
1285 try:
1286 objects.Node.release(context, conductor_hostname, node_id)
...
1292 except exception.NodeLocked:
1293 LOG.warning(...)
1297 continue


As far as 2.3, I think we should change the query string at the start of this 
method so that it includes nodes in maintenance mode. I think it's both safe 
and reasonable (and, frankly, what an operator will expect) that a node which 
is in maintenance mode, and in DEPLOYING state, whose conductor is offline, 
should have that reservation cleared and be set to DEPLOYFAILED state.

--devananda

>>
>> Definitely we should improve the option 2, but there are could be 
>> more issues I didn't know in a more complicated environment.
>> So my question is do we still need a new command to recover these 
>> node easier without accessing DB, like this PoC [2]:
>>   ironic-noderecover --node_uuids=UUID1,UUID2 
>> --config-file=/etc/ironic/ironic.conf
> 
> I'm -1 to anything silently removing the lock until I see a clear use 
> case which is impossible to improve within Ironic itself. Such utility may 
> and will be abused.
> 
> I'm fine with anything that does not forcibly remove the lock by default.
> 
>>
>> Best Regards,
>>
>> Tan
>>
>>
>> [1] https://review.openstack.org/#/c/319812
>> [2] https://review.openstack.org/#/c/311273/
>&g

Re: [openstack-dev] [ironic] Tooling for recovering nodes

2016-05-31 Thread Devananda van der Veen
On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>> Hi,
>>
>> Recently, I am working on a spec[1] in order to recover nodes which get stuck
>> in deploying state, so I really expect some feedback from you guys.
>>
>> Ironic nodes can be stuck in
>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is
>> reserved by a dead conductor (the exclusive lock was not released).
>> Any further requests will be denied by ironic because it thinks the node
>> resource is under control of another conductor.
>>
>> To be more clear, let's narrow the scope and focus on the deploying state
>> first. Currently, people do have several choices to clear the reserved lock:
>> 1. restart the dead conductor
>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the 
>> lock.
>> 3. The operator touches the DB to manually recover these nodes.
>>
>> Option two looks very promising but there are some weakness:
>> 2.1 It won't work if the dead conductor was renamed or deleted.
>> 2.2 It won't work if the node's specific driver was not enabled on live
>> conductors.
>> 2.3 It won't work if the node is in maintenance. (only a corner case).
> 
> We can and should fix all three cases.

2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().

The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a
list of Nodes reserved by *any* offline conductor and tries to release the lock.
However, it will always fail to update them, because objects.Node.release()
raises a NodeLocked exception when called on a Node locked by a different 
conductor.

Here's the relevant code path:

ironic/conductor/manager.py:
1259 def _check_deploying_status(self, context):
...
1269 offline_conductors = self.dbapi.get_offline_conductors()
...
1273 node_iter = self.iter_nodes(
1274 fields=['id', 'reservation'],
1275 filters={'provision_state': states.DEPLOYING,
1276  'maintenance': False,
1277  'reserved_by_any_of': offline_conductors})
...
1281 for node_uuid, driver, node_id, conductor_hostname in node_iter:
1285 try:
1286 objects.Node.release(context, conductor_hostname, node_id)
...
1292 except exception.NodeLocked:
1293 LOG.warning(...)
1297 continue


As far as 2.3, I think we should change the query string at the start of this
method so that it includes nodes in maintenance mode. I think it's both safe and
reasonable (and, frankly, what an operator will expect) that a node which is in
maintenance mode, and in DEPLOYING state, whose conductor is offline, should
have that reservation cleared and be set to DEPLOYFAILED state.

--devananda

>>
>> Definitely we should improve the option 2, but there are could be more issues
>> I didn't know in a more complicated environment.
>> So my question is do we still need a new command to recover these node easier
>> without accessing DB, like this PoC [2]:
>>   ironic-noderecover --node_uuids=UUID1,UUID2 
>> --config-file=/etc/ironic/ironic.conf
> 
> I'm -1 to anything silently removing the lock until I see a clear use case 
> which
> is impossible to improve within Ironic itself. Such utility may and will be 
> abused.
> 
> I'm fine with anything that does not forcibly remove the lock by default.
> 
>>
>> Best Regards,
>>
>> Tan
>>
>>
>> [1] https://review.openstack.org/#/c/319812
>> [2] https://review.openstack.org/#/c/311273/
>>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic] Tooling for recovering nodes

2016-05-31 Thread Dmitry Tantsur

On 05/31/2016 10:25 AM, Tan, Lin wrote:

Hi,

Recently, I am working on a spec[1] in order to recover nodes which get stuck 
in deploying state, so I really expect some feedback from you guys.

Ironic nodes can be stuck in 
deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is 
reserved by a dead conductor (the exclusive lock was not released).
Any further requests will be denied by ironic because it thinks the node 
resource is under control of another conductor.

To be more clear, let's narrow the scope and focus on the deploying state 
first. Currently, people do have several choices to clear the reserved lock:
1. restart the dead conductor
2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
3. The operator touches the DB to manually recover these nodes.

Option two looks very promising but there are some weakness:
2.1 It won't work if the dead conductor was renamed or deleted.
2.2 It won't work if the node's specific driver was not enabled on live 
conductors.
2.3 It won't work if the node is in maintenance. (only a corner case).


We can and should fix all three cases.



Definitely we should improve the option 2, but there are could be more issues I 
didn't know in a more complicated environment.
So my question is do we still need a new command to recover these node easier 
without accessing DB, like this PoC [2]:
  ironic-noderecover --node_uuids=UUID1,UUID2  
--config-file=/etc/ironic/ironic.conf


I'm -1 to anything silently removing the lock until I see a clear use 
case which is impossible to improve within Ironic itself. Such utility 
may and will be abused.


I'm fine with anything that does not forcibly remove the lock by default.



Best Regards,

Tan


[1] https://review.openstack.org/#/c/319812
[2] https://review.openstack.org/#/c/311273/


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ironic] Tooling for recovering nodes

2016-05-31 Thread Tan, Lin
Hi,

Recently, I am working on a spec[1] in order to recover nodes which get stuck 
in deploying state, so I really expect some feedback from you guys.

Ironic nodes can be stuck in 
deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is 
reserved by a dead conductor (the exclusive lock was not released).
Any further requests will be denied by ironic because it thinks the node 
resource is under control of another conductor.

To be more clear, let's narrow the scope and focus on the deploying state 
first. Currently, people do have several choices to clear the reserved lock:
1. restart the dead conductor
2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
3. The operator touches the DB to manually recover these nodes.

Option two looks very promising but there are some weakness:
2.1 It won't work if the dead conductor was renamed or deleted.
2.2 It won't work if the node's specific driver was not enabled on live 
conductors.
2.3 It won't work if the node is in maintenance. (only a corner case).

Definitely we should improve the option 2, but there are could be more issues I 
didn't know in a more complicated environment.
So my question is do we still need a new command to recover these node easier 
without accessing DB, like this PoC [2]:
  ironic-noderecover --node_uuids=UUID1,UUID2  
--config-file=/etc/ironic/ironic.conf

Best Regards,

Tan


[1] https://review.openstack.org/#/c/319812
[2] https://review.openstack.org/#/c/311273/


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev