Re: [openstack-dev] [ironic] [nova] Ironic virt driver resources reporting
On Wed, Jan 4, 2017 at 5:45 PM, Vladyslav Drokwrote: > Thanks all for replies! > > On Tue, Jan 3, 2017 at 5:16 PM, Jay Faulkner wrote: > >> Hey Vdrok, some comments inline. >> >> > On Dec 30, 2016, at 8:40 AM, Vladyslav Drok wrote: >> > >> > Hi all! >> > >> > There is a long standing problem of resources reporting in ironic virt >> driver. It's described in a couple of bugs I've found - [0], [1]. Switching >> to placement API will make things better, but still there are some problems >> there. For example, there are cases when ironic needs to say "this node is >> not available", and it reports the vcpus=memory_mb=local_gb as 0 in this >> case. Placement API does not allow 0s, so in [2] it is proposed to remove >> inventory records in this case. >> > >> > But the whole logic here [3] seems not that obvious to me, so I'd like >> to discuss when do we need to report 0s to placement API. I'm thinking >> about the following (copy-pasted from my comment on [2]): >> > >> > • If there is an instance_uuid on the node, no matter what >> provision/power state it's in, consider the resources as used. In case it's >> an orphan, an admin will need to take some manual action anyway. >> >> This won’t work, because of https://bugs.launchpad.net/nova/+bug/1503453 >> — basically the Nova resource tracker checks, decides we’re lying about it >> being used for an instance because Nova’s records don’t show we do, and it >> reads the capacity to the pool. >> > > Aha, I see, after looking at code a bit more and discussing with JayF, > that happens during update_available_resource here > https://github.com/openstack/nova/blob/372452a1f703115310ea3400f9f636 > 829759b80f/nova/compute/resource_tracker.py#L921-L934, where "instances" > are all instances assigned to current host and node. Though, I don't really > like the fact that _used amount is greater than the > amount that is possible here - https://github.com/openstack/nova/blob/ > 372452a1f703115310ea3400f9f636829759b80f/nova/virt/ironic/ > driver.py#L301-L326, as it makes the free values reported to be negative > (I can't find the place where they are set to 0 if negative). Maybe we > could at least report 0 for both available and used amounts? > OK, I must be blind, it is set to 0 if negative here https://github.com/openstack/nova/blob/372452a1f703115310ea3400f9f636829759b80f/nova/compute/resource_tracker.py#L938-L939, so it should be fine, apart from the fact that used value will be greater than available. > > >> >> Generally I agree with Jay Pipes’ comments — we should have available >> resources for nodes that can be scheduled to, used resources for nodes with >> with a nova instance, and report no resources whatsoever for nodes in an >> unschedulable state, such as cleaning, enroll, etc. >> >> - >> Jay Faulkner >> OSIC >> >> > • If there is no instance_uuid and a node is in cleaning/clean >> wait after tear down, it is a part of normal node lifecycle, report all >> resources as used. This means we need a way to determine if it's a manual >> or automated clean. >> > • If there is no instance_uuid, and a node: >> > • has a bad power state or >> > • is in maintenance >> > • or actually in any other case, consider it unavailable, >> report available resources = used resources = 0. Provision state does not >> matter in this logic, all cases that we wanted to take into account are >> described in the first two bullets. >> > >> > Any thoughts? >> > >> > [0]. https://bugs.launchpad.net/nova/+bug/1402658 >> > [1]. https://bugs.launchpad.net/nova/+bug/1637449 >> > [2]. https://review.openstack.org/414214 >> > [3]. https://github.com/openstack/nova/blob/1506c36b4446f6ba1487a >> 2d68e4b23cb3fca44cb/nova/virt/ironic/driver.py#L262 >> > >> > Happy holidays to everyone! >> > -Vlad >> > >> __ >> > OpenStack Development Mailing List (not for usage questions) >> > Unsubscribe: openstack-dev-requ...@lists.op >> enstack.org?subject:unsubscribe >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] [nova] Ironic virt driver resources reporting
Thanks all for replies! On Tue, Jan 3, 2017 at 5:16 PM, Jay Faulknerwrote: > Hey Vdrok, some comments inline. > > > On Dec 30, 2016, at 8:40 AM, Vladyslav Drok wrote: > > > > Hi all! > > > > There is a long standing problem of resources reporting in ironic virt > driver. It's described in a couple of bugs I've found - [0], [1]. Switching > to placement API will make things better, but still there are some problems > there. For example, there are cases when ironic needs to say "this node is > not available", and it reports the vcpus=memory_mb=local_gb as 0 in this > case. Placement API does not allow 0s, so in [2] it is proposed to remove > inventory records in this case. > > > > But the whole logic here [3] seems not that obvious to me, so I'd like > to discuss when do we need to report 0s to placement API. I'm thinking > about the following (copy-pasted from my comment on [2]): > > > > • If there is an instance_uuid on the node, no matter what > provision/power state it's in, consider the resources as used. In case it's > an orphan, an admin will need to take some manual action anyway. > > This won’t work, because of https://bugs.launchpad.net/nova/+bug/1503453 > — basically the Nova resource tracker checks, decides we’re lying about it > being used for an instance because Nova’s records don’t show we do, and it > reads the capacity to the pool. > Aha, I see, after looking at code a bit more and discussing with JayF, that happens during update_available_resource here https://github.com/openstack/nova/blob/372452a1f703115310ea3400f9f636829759b80f/nova/compute/resource_tracker.py#L921-L934, where "instances" are all instances assigned to current host and node. Though, I don't really like the fact that _used amount is greater than the amount that is possible here - https://github.com/openstack/nova/blob/372452a1f703115310ea3400f9f636829759b80f/nova/virt/ironic/driver.py#L301-L326, as it makes the free values reported to be negative (I can't find the place where they are set to 0 if negative). Maybe we could at least report 0 for both available and used amounts? > > Generally I agree with Jay Pipes’ comments — we should have available > resources for nodes that can be scheduled to, used resources for nodes with > with a nova instance, and report no resources whatsoever for nodes in an > unschedulable state, such as cleaning, enroll, etc. > > - > Jay Faulkner > OSIC > > > • If there is no instance_uuid and a node is in cleaning/clean > wait after tear down, it is a part of normal node lifecycle, report all > resources as used. This means we need a way to determine if it's a manual > or automated clean. > > • If there is no instance_uuid, and a node: > > • has a bad power state or > > • is in maintenance > > • or actually in any other case, consider it unavailable, > report available resources = used resources = 0. Provision state does not > matter in this logic, all cases that we wanted to take into account are > described in the first two bullets. > > > > Any thoughts? > > > > [0]. https://bugs.launchpad.net/nova/+bug/1402658 > > [1]. https://bugs.launchpad.net/nova/+bug/1637449 > > [2]. https://review.openstack.org/414214 > > [3]. https://github.com/openstack/nova/blob/ > 1506c36b4446f6ba1487a2d68e4b23cb3fca44cb/nova/virt/ironic/driver.py#L262 > > > > Happy holidays to everyone! > > -Vlad > > > __ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject: > unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] [nova] Ironic virt driver resources reporting
Hey Vdrok, some comments inline. > On Dec 30, 2016, at 8:40 AM, Vladyslav Drokwrote: > > Hi all! > > There is a long standing problem of resources reporting in ironic virt > driver. It's described in a couple of bugs I've found - [0], [1]. Switching > to placement API will make things better, but still there are some problems > there. For example, there are cases when ironic needs to say "this node is > not available", and it reports the vcpus=memory_mb=local_gb as 0 in this > case. Placement API does not allow 0s, so in [2] it is proposed to remove > inventory records in this case. > > But the whole logic here [3] seems not that obvious to me, so I'd like to > discuss when do we need to report 0s to placement API. I'm thinking about the > following (copy-pasted from my comment on [2]): > > • If there is an instance_uuid on the node, no matter what > provision/power state it's in, consider the resources as used. In case it's > an orphan, an admin will need to take some manual action anyway. This won’t work, because of https://bugs.launchpad.net/nova/+bug/1503453 — basically the Nova resource tracker checks, decides we’re lying about it being used for an instance because Nova’s records don’t show we do, and it reads the capacity to the pool. Generally I agree with Jay Pipes’ comments — we should have available resources for nodes that can be scheduled to, used resources for nodes with with a nova instance, and report no resources whatsoever for nodes in an unschedulable state, such as cleaning, enroll, etc. - Jay Faulkner OSIC > • If there is no instance_uuid and a node is in cleaning/clean wait > after tear down, it is a part of normal node lifecycle, report all resources > as used. This means we need a way to determine if it's a manual or automated > clean. > • If there is no instance_uuid, and a node: > • has a bad power state or > • is in maintenance > • or actually in any other case, consider it unavailable, > report available resources = used resources = 0. Provision state does not > matter in this logic, all cases that we wanted to take into account are > described in the first two bullets. > > Any thoughts? > > [0]. https://bugs.launchpad.net/nova/+bug/1402658 > [1]. https://bugs.launchpad.net/nova/+bug/1637449 > [2]. https://review.openstack.org/414214 > [3]. > https://github.com/openstack/nova/blob/1506c36b4446f6ba1487a2d68e4b23cb3fca44cb/nova/virt/ironic/driver.py#L262 > > Happy holidays to everyone! > -Vlad > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] [nova] Ironic virt driver resources reporting
Hi, a comment about 'report as full' vs 'remove from inventory' On Mon, Jan 2, 2017 at 7:53 PM, Jay Pipeswrote: > Great questions, Vlad. Comments inline. > > On 12/30/2016 11:40 AM, Vladyslav Drok wrote: > >> Hi all! >> >> There is a long standing problem of resources reporting in ironic virt >> driver. >> > > That would be an understatement :) > > > It's described in a couple of bugs I've found - [0], [1]. > >> Switching to placement API will make things better, but still there are >> some problems there. For example, there are cases when ironic needs to >> say "this node is not available", and it reports the >> vcpus=memory_mb=local_gb as 0 in this case. Placement API does not allow >> 0s, so in [2] it is proposed to remove inventory records in this case. >> > > Correct. > > But the whole logic here [3] seems not that obvious to me, so I'd like >> to discuss when do we need to report 0s to placement API. I'm thinking >> about the following (copy-pasted from my comment on [2]): >> >> * If there is an instance_uuid on the node, no matter what >> provision/power state it's in, consider the resources as used. In >> case it's an orphan, an admin will need to take some manual action >> anyway. >> > > The single source of truth for Ironic instances is the Ironic database. If > Ironic's database says that a node is consumed by an instance, then it > should be considered by Nova to be consumed. > Well, it is nova that marks the instance as consumed by setting the instance_uuid field on the node :) The question is when is the right time to remove it... (see my next comment below). Currently it is removed before teardown/undeploy, so the node in CLEANING state already has no instance_uuid on itself. > * If there is no instance_uuid and a node is in cleaning/clean wait >> after tear down, it is a part of normal node lifecycle, report all >> resources as used. This means we need a way to determine if it's a >> manual or automated clean. >> > > I don't see a need to determine manual vs. automated clean. The node is in > a clean state; therefore the inventory of resources on that node are not > available for a consumer of those resources to consume. So, the inventory > should be deleted in Nova. This inventory should be re-added if and when > the node is in a state that a consumer can grab it. > > There is a difference between "removing the resource from available" vs "declaring the resource fully consumed" - the end result for scheduling is the same (those resources are not being scheduled to), but I am worrying about any cloud-wide monitoring mechanisms that may start alerting about hypervisors disappearing / total cloud capacity going down even though everything is operating normally. IMO during the happy path for nova instance on ironic node ( node available -> nova does deploy -> node active -> nova does undeploy -> node is available, with all intermediate *ing / *_wait states) the node should be reported as "fully consumed by instance" as cleaning in this case is a standard part of healthy node lifecycle. Only when something out of happy path happens (maintenance, deploy or cleaning error) should the node be removed from overall cloud capacity. And this is why we might have to differentiate between automated cleaning (happy path) vs manual cleaning (usually some manual recovery from error). Due to this I'd also suggest to remove the instance_uud from ironic node in the end of cleaning, should make clearer in which stage is the node right now. > * If there is no instance_uuid, and a node: >> o has a bad power state or >> o is in maintenance >> o or actually in any other case, consider it unavailable, report >> available resources = used resources = 0. Provision state does >> not matter in this logic, all cases that we wanted to take into >> account are described in the first two bullets. >> > > Correct. If there is no instance UUID for the node, that means there's no > allocation for it. If there's no allocation for the node, its inventory can > and should be deleted if the node cannot be consumed by an instance (for > whatever reason). > > Best, > -jay > > Any thoughts? >> >> [0]. https://bugs.launchpad.net/nova/+bug/1402658 >> [1]. https://bugs.launchpad.net/nova/+bug/1637449 >> [2]. https://review.openstack.org/414214 >> [3]. https://github.com/openstack/nova/blob/1506c36b4446f6ba1487a >> 2d68e4b23cb3fca44cb/nova/virt/ironic/driver.py#L262 >> >> Happy holidays to everyone! >> -Vlad >> >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > __ > OpenStack Development Mailing List (not for
Re: [openstack-dev] [ironic] [nova] Ironic virt driver resources reporting
Great questions, Vlad. Comments inline. On 12/30/2016 11:40 AM, Vladyslav Drok wrote: Hi all! There is a long standing problem of resources reporting in ironic virt driver. That would be an understatement :) > It's described in a couple of bugs I've found - [0], [1]. Switching to placement API will make things better, but still there are some problems there. For example, there are cases when ironic needs to say "this node is not available", and it reports the vcpus=memory_mb=local_gb as 0 in this case. Placement API does not allow 0s, so in [2] it is proposed to remove inventory records in this case. Correct. But the whole logic here [3] seems not that obvious to me, so I'd like to discuss when do we need to report 0s to placement API. I'm thinking about the following (copy-pasted from my comment on [2]): * If there is an instance_uuid on the node, no matter what provision/power state it's in, consider the resources as used. In case it's an orphan, an admin will need to take some manual action anyway. The single source of truth for Ironic instances is the Ironic database. If Ironic's database says that a node is consumed by an instance, then it should be considered by Nova to be consumed. * If there is no instance_uuid and a node is in cleaning/clean wait after tear down, it is a part of normal node lifecycle, report all resources as used. This means we need a way to determine if it's a manual or automated clean. I don't see a need to determine manual vs. automated clean. The node is in a clean state; therefore the inventory of resources on that node are not available for a consumer of those resources to consume. So, the inventory should be deleted in Nova. This inventory should be re-added if and when the node is in a state that a consumer can grab it. * If there is no instance_uuid, and a node: o has a bad power state or o is in maintenance o or actually in any other case, consider it unavailable, report available resources = used resources = 0. Provision state does not matter in this logic, all cases that we wanted to take into account are described in the first two bullets. Correct. If there is no instance UUID for the node, that means there's no allocation for it. If there's no allocation for the node, its inventory can and should be deleted if the node cannot be consumed by an instance (for whatever reason). Best, -jay Any thoughts? [0]. https://bugs.launchpad.net/nova/+bug/1402658 [1]. https://bugs.launchpad.net/nova/+bug/1637449 [2]. https://review.openstack.org/414214 [3]. https://github.com/openstack/nova/blob/1506c36b4446f6ba1487a2d68e4b23cb3fca44cb/nova/virt/ironic/driver.py#L262 Happy holidays to everyone! -Vlad __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [ironic] [nova] Ironic virt driver resources reporting
Hi all! There is a long standing problem of resources reporting in ironic virt driver. It's described in a couple of bugs I've found - [0], [1]. Switching to placement API will make things better, but still there are some problems there. For example, there are cases when ironic needs to say "this node is not available", and it reports the vcpus=memory_mb=local_gb as 0 in this case. Placement API does not allow 0s, so in [2] it is proposed to remove inventory records in this case. But the whole logic here [3] seems not that obvious to me, so I'd like to discuss when do we need to report 0s to placement API. I'm thinking about the following (copy-pasted from my comment on [2]): - If there is an instance_uuid on the node, no matter what provision/power state it's in, consider the resources as used. In case it's an orphan, an admin will need to take some manual action anyway. - If there is no instance_uuid and a node is in cleaning/clean wait after tear down, it is a part of normal node lifecycle, report all resources as used. This means we need a way to determine if it's a manual or automated clean. - If there is no instance_uuid, and a node: - has a bad power state or - is in maintenance - or actually in any other case, consider it unavailable, report available resources = used resources = 0. Provision state does not matter in this logic, all cases that we wanted to take into account are described in the first two bullets. Any thoughts? [0]. https://bugs.launchpad.net/nova/+bug/1402658 [1]. https://bugs.launchpad.net/nova/+bug/1637449 [2]. https://review.openstack.org/414214 [3]. https://github.com/openstack/nova/blob/1506c36b4446f6ba1487a2d68e4b23cb3fca44cb/nova/virt/ironic/driver.py#L262 Happy holidays to everyone! -Vlad __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev